The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,
! kaggle competitions files home-credit-default-risk
It is quite easy to setup, it takes me less than 15 minutes to finish a submission.
kaggle.json filekaggle.json in the right placeFor more detailed information on setting the Kaggle API see here and here.
# !pip install kaggle
!pwd
/geode2/home/u060/nmalpani/Carbonate/miniconda3/bin
!mkdir ~/.kaggle
!cp /root/shared/Downloads/kaggle.json ~/.kaggle
!chmod 600 ~/.kaggle/kaggle.json
mkdir: cannot create directory ‘/N/u/nmalpani/Carbonate/.kaggle’: File exists cp: cannot stat ‘/root/shared/Downloads/kaggle.json’: Permission denied chmod: cannot access ‘/N/u/nmalpani/Carbonate/.kaggle/kaggle.json’: No such file or directory
! kaggle competitions files home-credit-default-risk
/bin/bash: kaggle: command not found
Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.
The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.
Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).
While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.
There are 7 different sources of data:
# 
Create a base directory:
DATA_DIR = "../../../Data/home-credit-default-risk" #same level as course repo in the data directory
Please download the project data files and data dictionary and unzip them using either of the following approaches:
Download button on the following Data Webpage and unzip the zip file to the BASE_DIRDATA_DIR = "Data/home-credit-default-risk" #same level as course repo in the data directory
#DATA_DIR = os.path.join('./ddddd/')
!mkdir $DATA_DIR
mkdir: cannot create directory ‘Data/home-credit-default-risk’: File exists
!ls -l $DATA_DIR
total 2622080 -rw-rw-r-- 1 nmalpani nmalpani 26567651 Dec 11 2019 application_test.csv -rw-rw-r-- 1 nmalpani nmalpani 166133370 Dec 11 2019 application_train.csv -rw-rw-r-- 1 nmalpani nmalpani 375592889 Dec 11 2019 bureau_balance.csv -rw-rw-r-- 1 nmalpani nmalpani 170016717 Dec 11 2019 bureau.csv -rw-rw-r-- 1 nmalpani nmalpani 424582605 Dec 11 2019 credit_card_balance.csv -rw-rw-r-- 1 nmalpani nmalpani 37383 Dec 11 2019 HomeCredit_columns_description.csv -rw-rw-r-- 1 nmalpani nmalpani 723118349 Dec 11 2019 installments_payments.csv -rw-rw-r-- 1 nmalpani nmalpani 392703158 Dec 11 2019 POS_CASH_balance.csv -rw-rw-r-- 1 nmalpani nmalpani 404973293 Dec 11 2019 previous_application.csv -rw-rw-r-- 1 nmalpani nmalpani 536202 Dec 11 2019 sample_submission.csv
# ! kaggle competitions download home-credit-default-risk -p $DATA_DIR
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
%matplotlib inline
unzippingReq = False
if unzippingReq: #please modify this code
zip_ref = zipfile.ZipFile('application_train.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('application_test.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('bureau_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('bureau.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('credit_card_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('installments_payments.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('POS_CASH_balance.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
zip_ref = zipfile.ZipFile('previous_application.csv.zip', 'r')
zip_ref.extractall('datasets')
zip_ref.close()
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import os
import zipfile
from sklearn.base import BaseEstimator, TransformerMixin
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
def load_data(in_path, name):
df = pd.read_csv(in_path)
print(f"{name}: shape is {df.shape}")
print(df.info())
display(df.head(5))
return df
datasets={} # lets store the datasets in a dictionary so we can keep track of them easily
ds_name = 'application_train'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
datasets['application_train'].shape
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
(307511, 122)
ds_name = 'application_test'
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
The application dataset has the most information about the client: Gender, income, family status, education ...
%%time
ds_names = ("application_train", "application_test", "bureau","bureau_balance","credit_card_balance","installments_payments",
"previous_application","POS_CASH_balance")
for ds_name in ds_names:
datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)
application_train: shape is (307511, 122) <class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB None
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 122 columns
application_test: shape is (48744, 121) <class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(40), object(16) memory usage: 45.0+ MB None
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 0 | 135000.0 | 568800.0 | 20560.5 | 450000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 0 | 99000.0 | 222768.0 | 17370.0 | 180000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 0 | 202500.0 | 663264.0 | 69777.0 | 630000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 2 | 315000.0 | 1575000.0 | 49018.5 | 1575000.0 | ... | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 1 | 180000.0 | 625500.0 | 32067.0 | 625500.0 | ... | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 121 columns
bureau: shape is (1716428, 17) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1716428 entries, 0 to 1716427 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 SK_ID_BUREAU int64 2 CREDIT_ACTIVE object 3 CREDIT_CURRENCY object 4 DAYS_CREDIT int64 5 CREDIT_DAY_OVERDUE int64 6 DAYS_CREDIT_ENDDATE float64 7 DAYS_ENDDATE_FACT float64 8 AMT_CREDIT_MAX_OVERDUE float64 9 CNT_CREDIT_PROLONG int64 10 AMT_CREDIT_SUM float64 11 AMT_CREDIT_SUM_DEBT float64 12 AMT_CREDIT_SUM_LIMIT float64 13 AMT_CREDIT_SUM_OVERDUE float64 14 CREDIT_TYPE object 15 DAYS_CREDIT_UPDATE int64 16 AMT_ANNUITY float64 dtypes: float64(8), int64(6), object(3) memory usage: 222.6+ MB None
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
bureau_balance: shape is (27299925, 3) <class 'pandas.core.frame.DataFrame'> RangeIndex: 27299925 entries, 0 to 27299924 Data columns (total 3 columns): # Column Dtype --- ------ ----- 0 SK_ID_BUREAU int64 1 MONTHS_BALANCE int64 2 STATUS object dtypes: int64(2), object(1) memory usage: 624.8+ MB None
| SK_ID_BUREAU | MONTHS_BALANCE | STATUS | |
|---|---|---|---|
| 0 | 5715448 | 0 | C |
| 1 | 5715448 | -1 | C |
| 2 | 5715448 | -2 | C |
| 3 | 5715448 | -3 | C |
| 4 | 5715448 | -4 | C |
credit_card_balance: shape is (3840312, 23) <class 'pandas.core.frame.DataFrame'> RangeIndex: 3840312 entries, 0 to 3840311 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 AMT_BALANCE float64 4 AMT_CREDIT_LIMIT_ACTUAL int64 5 AMT_DRAWINGS_ATM_CURRENT float64 6 AMT_DRAWINGS_CURRENT float64 7 AMT_DRAWINGS_OTHER_CURRENT float64 8 AMT_DRAWINGS_POS_CURRENT float64 9 AMT_INST_MIN_REGULARITY float64 10 AMT_PAYMENT_CURRENT float64 11 AMT_PAYMENT_TOTAL_CURRENT float64 12 AMT_RECEIVABLE_PRINCIPAL float64 13 AMT_RECIVABLE float64 14 AMT_TOTAL_RECEIVABLE float64 15 CNT_DRAWINGS_ATM_CURRENT float64 16 CNT_DRAWINGS_CURRENT int64 17 CNT_DRAWINGS_OTHER_CURRENT float64 18 CNT_DRAWINGS_POS_CURRENT float64 19 CNT_INSTALMENT_MATURE_CUM float64 20 NAME_CONTRACT_STATUS object 21 SK_DPD int64 22 SK_DPD_DEF int64 dtypes: float64(15), int64(7), object(1) memory usage: 673.9+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
installments_payments: shape is (13605401, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 13605401 entries, 0 to 13605400 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 NUM_INSTALMENT_VERSION float64 3 NUM_INSTALMENT_NUMBER int64 4 DAYS_INSTALMENT float64 5 DAYS_ENTRY_PAYMENT float64 6 AMT_INSTALMENT float64 7 AMT_PAYMENT float64 dtypes: float64(5), int64(3) memory usage: 830.4 MB None
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
previous_application: shape is (1670214, 37) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB None
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
POS_CASH_balance: shape is (10001358, 8) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10001358 entries, 0 to 10001357 Data columns (total 8 columns): # Column Dtype --- ------ ----- 0 SK_ID_PREV int64 1 SK_ID_CURR int64 2 MONTHS_BALANCE int64 3 CNT_INSTALMENT float64 4 CNT_INSTALMENT_FUTURE float64 5 NAME_CONTRACT_STATUS object 6 SK_DPD int64 7 SK_DPD_DEF int64 dtypes: float64(2), int64(5), object(1) memory usage: 610.4+ MB None
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
CPU times: user 33.8 s, sys: 15.5 s, total: 49.2 s Wall time: 49.3 s
for ds_name in datasets.keys():
print(f'dataset {ds_name:24}: [ {datasets[ds_name].shape[0]:10,}, {datasets[ds_name].shape[1]}]')
dataset application_train : [ 307,511, 122] dataset application_test : [ 48,744, 121] dataset bureau : [ 1,716,428, 17] dataset bureau_balance : [ 27,299,925, 3] dataset credit_card_balance : [ 3,840,312, 23] dataset installments_payments : [ 13,605,401, 8] dataset previous_application : [ 1,670,214, 37] dataset POS_CASH_balance : [ 10,001,358, 8]
datasets["application_train"].info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 122 columns): # Column Dtype --- ------ ----- 0 SK_ID_CURR int64 1 TARGET int64 2 NAME_CONTRACT_TYPE object 3 CODE_GENDER object 4 FLAG_OWN_CAR object 5 FLAG_OWN_REALTY object 6 CNT_CHILDREN int64 7 AMT_INCOME_TOTAL float64 8 AMT_CREDIT float64 9 AMT_ANNUITY float64 10 AMT_GOODS_PRICE float64 11 NAME_TYPE_SUITE object 12 NAME_INCOME_TYPE object 13 NAME_EDUCATION_TYPE object 14 NAME_FAMILY_STATUS object 15 NAME_HOUSING_TYPE object 16 REGION_POPULATION_RELATIVE float64 17 DAYS_BIRTH int64 18 DAYS_EMPLOYED int64 19 DAYS_REGISTRATION float64 20 DAYS_ID_PUBLISH int64 21 OWN_CAR_AGE float64 22 FLAG_MOBIL int64 23 FLAG_EMP_PHONE int64 24 FLAG_WORK_PHONE int64 25 FLAG_CONT_MOBILE int64 26 FLAG_PHONE int64 27 FLAG_EMAIL int64 28 OCCUPATION_TYPE object 29 CNT_FAM_MEMBERS float64 30 REGION_RATING_CLIENT int64 31 REGION_RATING_CLIENT_W_CITY int64 32 WEEKDAY_APPR_PROCESS_START object 33 HOUR_APPR_PROCESS_START int64 34 REG_REGION_NOT_LIVE_REGION int64 35 REG_REGION_NOT_WORK_REGION int64 36 LIVE_REGION_NOT_WORK_REGION int64 37 REG_CITY_NOT_LIVE_CITY int64 38 REG_CITY_NOT_WORK_CITY int64 39 LIVE_CITY_NOT_WORK_CITY int64 40 ORGANIZATION_TYPE object 41 EXT_SOURCE_1 float64 42 EXT_SOURCE_2 float64 43 EXT_SOURCE_3 float64 44 APARTMENTS_AVG float64 45 BASEMENTAREA_AVG float64 46 YEARS_BEGINEXPLUATATION_AVG float64 47 YEARS_BUILD_AVG float64 48 COMMONAREA_AVG float64 49 ELEVATORS_AVG float64 50 ENTRANCES_AVG float64 51 FLOORSMAX_AVG float64 52 FLOORSMIN_AVG float64 53 LANDAREA_AVG float64 54 LIVINGAPARTMENTS_AVG float64 55 LIVINGAREA_AVG float64 56 NONLIVINGAPARTMENTS_AVG float64 57 NONLIVINGAREA_AVG float64 58 APARTMENTS_MODE float64 59 BASEMENTAREA_MODE float64 60 YEARS_BEGINEXPLUATATION_MODE float64 61 YEARS_BUILD_MODE float64 62 COMMONAREA_MODE float64 63 ELEVATORS_MODE float64 64 ENTRANCES_MODE float64 65 FLOORSMAX_MODE float64 66 FLOORSMIN_MODE float64 67 LANDAREA_MODE float64 68 LIVINGAPARTMENTS_MODE float64 69 LIVINGAREA_MODE float64 70 NONLIVINGAPARTMENTS_MODE float64 71 NONLIVINGAREA_MODE float64 72 APARTMENTS_MEDI float64 73 BASEMENTAREA_MEDI float64 74 YEARS_BEGINEXPLUATATION_MEDI float64 75 YEARS_BUILD_MEDI float64 76 COMMONAREA_MEDI float64 77 ELEVATORS_MEDI float64 78 ENTRANCES_MEDI float64 79 FLOORSMAX_MEDI float64 80 FLOORSMIN_MEDI float64 81 LANDAREA_MEDI float64 82 LIVINGAPARTMENTS_MEDI float64 83 LIVINGAREA_MEDI float64 84 NONLIVINGAPARTMENTS_MEDI float64 85 NONLIVINGAREA_MEDI float64 86 FONDKAPREMONT_MODE object 87 HOUSETYPE_MODE object 88 TOTALAREA_MODE float64 89 WALLSMATERIAL_MODE object 90 EMERGENCYSTATE_MODE object 91 OBS_30_CNT_SOCIAL_CIRCLE float64 92 DEF_30_CNT_SOCIAL_CIRCLE float64 93 OBS_60_CNT_SOCIAL_CIRCLE float64 94 DEF_60_CNT_SOCIAL_CIRCLE float64 95 DAYS_LAST_PHONE_CHANGE float64 96 FLAG_DOCUMENT_2 int64 97 FLAG_DOCUMENT_3 int64 98 FLAG_DOCUMENT_4 int64 99 FLAG_DOCUMENT_5 int64 100 FLAG_DOCUMENT_6 int64 101 FLAG_DOCUMENT_7 int64 102 FLAG_DOCUMENT_8 int64 103 FLAG_DOCUMENT_9 int64 104 FLAG_DOCUMENT_10 int64 105 FLAG_DOCUMENT_11 int64 106 FLAG_DOCUMENT_12 int64 107 FLAG_DOCUMENT_13 int64 108 FLAG_DOCUMENT_14 int64 109 FLAG_DOCUMENT_15 int64 110 FLAG_DOCUMENT_16 int64 111 FLAG_DOCUMENT_17 int64 112 FLAG_DOCUMENT_18 int64 113 FLAG_DOCUMENT_19 int64 114 FLAG_DOCUMENT_20 int64 115 FLAG_DOCUMENT_21 int64 116 AMT_REQ_CREDIT_BUREAU_HOUR float64 117 AMT_REQ_CREDIT_BUREAU_DAY float64 118 AMT_REQ_CREDIT_BUREAU_WEEK float64 119 AMT_REQ_CREDIT_BUREAU_MON float64 120 AMT_REQ_CREDIT_BUREAU_QRT float64 121 AMT_REQ_CREDIT_BUREAU_YEAR float64 dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
datasets["application_train"].shape
(307511, 122)
datasets["application_train"].describe() #numerical only features
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | 3.072330e+05 | 307511.000000 | 307511.000000 | 307511.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| mean | 278180.518577 | 0.080729 | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | 5.383962e+05 | 0.020868 | -16036.995067 | 63815.045904 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | 3.694465e+05 | 0.013831 | 4363.988632 | 141275.766519 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | 4.050000e+04 | 0.000290 | -25229.000000 | -17912.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | 2.385000e+05 | 0.010006 | -19682.000000 | -2760.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | 4.500000e+05 | 0.018850 | -15750.000000 | -1213.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | 6.795000e+05 | 0.028663 | -12413.000000 | -289.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | 4.050000e+06 | 0.072508 | -7489.000000 | 365243.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
8 rows × 106 columns
datasets["application_test"].describe() #numerical only features
| SK_ID_CURR | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 4.874400e+04 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | ... | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 0.397054 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 4.626188e+05 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | ... | 0.001559 | 0.0 | 0.0 | 0.0 | 0.002108 | 0.001803 | 0.002787 | 0.009299 | 0.546902 | 1.983769 |
| std | 103169.547296 | 0.709047 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 3.367102e+05 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | ... | 0.039456 | 0.0 | 0.0 | 0.0 | 0.046373 | 0.046132 | 0.054037 | 0.110924 | 0.693305 | 1.838873 |
| min | 100001.000000 | 0.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 4.500000e+04 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 0.000000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 2.250000e+05 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 0.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 3.960000e+05 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 1.000000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 6.300000e+05 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | ... | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.000000 |
| max | 456250.000000 | 20.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 2.245500e+06 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | ... | 1.000000 | 0.0 | 0.0 | 0.0 | 2.000000 | 2.000000 | 2.000000 | 6.000000 | 7.000000 | 17.000000 |
8 rows × 105 columns
datasets["application_test"].shape
(48744, 121)
datasets["application_train"].describe(include='all') #look at all categorical and numerical
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | ... | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 307511.000000 | 307511.000000 | 307511 | 307511 | 307511 | 307511 | 307511.000000 | 3.075110e+05 | 3.075110e+05 | 307499.000000 | ... | 307511.000000 | 307511.000000 | 307511.000000 | 307511.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 | 265992.000000 |
| unique | NaN | NaN | 2 | 3 | 2 | 2 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| top | NaN | NaN | Cash loans | F | N | Y | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| freq | NaN | NaN | 278232 | 202448 | 202924 | 213312 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| mean | 278180.518577 | 0.080729 | NaN | NaN | NaN | NaN | 0.417052 | 1.687979e+05 | 5.990260e+05 | 27108.573909 | ... | 0.008130 | 0.000595 | 0.000507 | 0.000335 | 0.006402 | 0.007000 | 0.034362 | 0.267395 | 0.265474 | 1.899974 |
| std | 102790.175348 | 0.272419 | NaN | NaN | NaN | NaN | 0.722121 | 2.371231e+05 | 4.024908e+05 | 14493.737315 | ... | 0.089798 | 0.024387 | 0.022518 | 0.018299 | 0.083849 | 0.110757 | 0.204685 | 0.916002 | 0.794056 | 1.869295 |
| min | 100002.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 2.565000e+04 | 4.500000e+04 | 1615.500000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 189145.500000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.125000e+05 | 2.700000e+05 | 16524.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 278202.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.000000 | 1.471500e+05 | 5.135310e+05 | 24903.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 367142.500000 | 0.000000 | NaN | NaN | NaN | NaN | 1.000000 | 2.025000e+05 | 8.086500e+05 | 34596.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 3.000000 |
| max | 456255.000000 | 1.000000 | NaN | NaN | NaN | NaN | 19.000000 | 1.170000e+08 | 4.050000e+06 | 258025.500000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 4.000000 | 9.000000 | 8.000000 | 27.000000 | 261.000000 | 25.000000 |
11 rows × 122 columns
datasets['application_train'].corrwith(datasets['application_train']["TARGET"]).sort_values(ascending = False)
TARGET 1.000000
DAYS_BIRTH 0.078239
REGION_RATING_CLIENT_W_CITY 0.060893
REGION_RATING_CLIENT 0.058899
DAYS_LAST_PHONE_CHANGE 0.055218
...
FLOORSMAX_AVG -0.044003
DAYS_EMPLOYED -0.044932
EXT_SOURCE_1 -0.155317
EXT_SOURCE_2 -0.160472
EXT_SOURCE_3 -0.178919
Length: 106, dtype: float64
As we an see above, here we have tried to find out the correlation of various indepenndent features in the dataset with the dependent variable or the target variable and tried understanding the feature correlation. We see that there are very few columns which are related to the target variable, thus making it all the. ore difficult to draw our predictions. Let's see how we proceed with it further!
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
| Percent | Train Missing Count | |
|---|---|---|
| COMMONAREA_MEDI | 69.87 | 214865 |
| COMMONAREA_AVG | 69.87 | 214865 |
| COMMONAREA_MODE | 69.87 | 214865 |
| NONLIVINGAPARTMENTS_MODE | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_AVG | 69.43 | 213514 |
| NONLIVINGAPARTMENTS_MEDI | 69.43 | 213514 |
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| LIVINGAPARTMENTS_MODE | 68.35 | 210199 |
| LIVINGAPARTMENTS_AVG | 68.35 | 210199 |
| LIVINGAPARTMENTS_MEDI | 68.35 | 210199 |
| FLOORSMIN_AVG | 67.85 | 208642 |
| FLOORSMIN_MODE | 67.85 | 208642 |
| FLOORSMIN_MEDI | 67.85 | 208642 |
| YEARS_BUILD_MEDI | 66.50 | 204488 |
| YEARS_BUILD_MODE | 66.50 | 204488 |
| YEARS_BUILD_AVG | 66.50 | 204488 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_MEDI | 59.38 | 182590 |
| LANDAREA_MODE | 59.38 | 182590 |
| LANDAREA_AVG | 59.38 | 182590 |
datasets['application_train'].isnull().sum()
SK_ID_CURR 0
TARGET 0
NAME_CONTRACT_TYPE 0
CODE_GENDER 0
FLAG_OWN_CAR 0
...
AMT_REQ_CREDIT_BUREAU_DAY 41519
AMT_REQ_CREDIT_BUREAU_WEEK 41519
AMT_REQ_CREDIT_BUREAU_MON 41519
AMT_REQ_CREDIT_BUREAU_QRT 41519
AMT_REQ_CREDIT_BUREAU_YEAR 41519
Length: 122, dtype: int64
Missingno is a Python library that provides the ability to understand the distribution of missing values through informative visualizations. The visualizations can be in the form of heat maps or bar charts. With this library, it is possible to observe where the missing values have occurred and to check the correlation of the columns containing the missing with the target column.
df = datasets['application_train'].copy()
import missingno as msno
# msno.matrix(df)
msno.bar(datasets['application_train'])
<AxesSubplot:>
The above graph showing the number of missing values with respect to the columns.
A Correlation Matrix is used to examine the relationship between multiple variables at the same time. When we do this calculation we get a table containing the correlation coefficients between each variable and the others. Now, the coefficient show us both the strength of the relationship and its direction (positive or negative correlations). In Python, a correlation matrix can be created using the Python packages Pandas and NumPy, for instance.
appp_50 = datasets['application_train'][datasets['application_train'].columns[datasets['application_train'].isnull().sum()>0.5]]
plt.figure(figsize=(40,40))
sns.heatmap(appp_50.corr(),annot=True,cmap='BuPu')
<AxesSubplot:>
datasets['application_train'] = datasets['application_train'].drop(['TOTALAREA_MODE','ELEVATORS_AVG','LIVINGAPARTMENTS_AVG',
'LIVINGAREA_AVG','APARTMENTS_MODE','ELEVATORS_MODE',
'LIVINGAPARTMENTS_MODE','LIVINGAREA_MODE'
,'APARTMENTS_MEDI','ELEVATORS_MEDI','LIVINGAPARTMENTS_MEDI','LIVINGAREA_MEDI'],axis=1)
appp_50A = datasets['application_train'][datasets['application_train'].columns[datasets['application_train'].isnull().sum()>0.5]]
plt.figure(figsize=(40,40))
sns.heatmap(appp_50A.corr(),annot=True, cmap = 'BuPu')
<AxesSubplot:>
datasets['application_train'] = datasets['application_train'].drop(['NONLIVINGAREA_MEDI','NONLIVINGAPARTMENTS_MEDI','FLOORSMIN_MEDI',
'FLOORSMAX_MEDI','ENTRANCES_MEDI','FLOORSMAX_MODE','FLOORSMIN_MODE','BASEMENTAREA_MEDI',
'BASEMENTAREA_MODE','YEARS_BUILD_MEDI','NONLIVINGAPARTMENTS_MODE','NONLIVINGAREA_MODE','YEARS_BUILD_MEDI',
'YEARS_BUILD_MODE','YEARS_BEGINEXPLUATATION_MEDI','YEARS_BEGINEXPLUATATION_MODE','COMMONAREA_MEDI','COMMONAREA_MODE',
'ENTRANCES_MODE','LANDAREA_MODE','LANDAREA_MEDI','DEF_30_CNT_SOCIAL_CIRCLE','OBS_30_CNT_SOCIAL_CIRCLE',
'REG_CITY_NOT_LIVE_CITY','REGION_RATING_CLIENT',
'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'FLOORSMIN_AVG','DAYS_LAST_PHONE_CHANGE',
'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG',
'LIVE_CITY_NOT_WORK_CITY','LIVE_REGION_NOT_WORK_REGION','AMT_GOODS_PRICE',
'CNT_CHILDREN'
],axis = 1)
percent = (datasets["application_train"].isnull().sum()/datasets["application_train"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_train"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
missing_application_train_data.head(20)
| Percent | Train Missing Count | |
|---|---|---|
| FONDKAPREMONT_MODE | 68.39 | 210295 |
| OWN_CAR_AGE | 65.99 | 202929 |
| LANDAREA_AVG | 59.38 | 182590 |
| BASEMENTAREA_AVG | 58.52 | 179943 |
| EXT_SOURCE_1 | 56.38 | 173378 |
| WALLSMATERIAL_MODE | 50.84 | 156341 |
| APARTMENTS_AVG | 50.75 | 156061 |
| ENTRANCES_AVG | 50.35 | 154828 |
| HOUSETYPE_MODE | 50.18 | 154297 |
| FLOORSMAX_AVG | 49.76 | 153020 |
| YEARS_BEGINEXPLUATATION_AVG | 48.78 | 150007 |
| EMERGENCYSTATE_MODE | 47.40 | 145755 |
| OCCUPATION_TYPE | 31.35 | 96391 |
| EXT_SOURCE_3 | 19.83 | 60965 |
| AMT_REQ_CREDIT_BUREAU_HOUR | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_DAY | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_WEEK | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_MON | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_QRT | 13.50 | 41519 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 13.50 | 41519 |
datasets['application_train'].shape
(307511, 76)
plt.figure(figsize=(40,40))
sns.heatmap(datasets['application_train'].corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
datasets['application_train'].shape
(307511, 76)
plt.figure(figsize=(10,10))
plt.title("Count of Default", fontweight = 'bold', fontsize = 16)
sns.countplot(x ='TARGET',data=datasets['application_train'], hue='TARGET',palette="Purples")
<AxesSubplot:title={'center':'Count of Default'}, xlabel='TARGET', ylabel='count'>
People who repaid the loan were almost 10 times than the defaulters.
plt.figure(figsize=(10,8))
plt.title("Suite Type", fontweight = 'bold', fontsize = 16)
sns.countplot(datasets['application_train']['NAME_TYPE_SUITE'], hue=datasets['application_train']['TARGET'], palette = 'Purples')
<AxesSubplot:title={'center':'Suite Type'}, xlabel='NAME_TYPE_SUITE', ylabel='count'>
plt.figure(figsize=(10,8))
plt.title("Distribution of Applicant's Family Members Count", fontweight = 'bold', fontsize = 16)
sns.countplot(datasets['application_train']['CNT_FAM_MEMBERS'],hue=datasets['application_train']['TARGET'], palette = 'Purples')
<AxesSubplot:title={'center':"Distribution of Applicant's Family Members Count"}, xlabel='CNT_FAM_MEMBERS', ylabel='count'>
datasets['application_train'].corrwith(datasets['application_train']["TARGET"]).sort_values(ascending = False)
TARGET 1.000000 DAYS_BIRTH 0.078239 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_ID_PUBLISH 0.051457 REG_CITY_NOT_WORK_CITY 0.050994 FLAG_EMP_PHONE 0.045982 FLAG_DOCUMENT_3 0.044346 DAYS_REGISTRATION 0.041975 OWN_CAR_AGE 0.037612 DEF_60_CNT_SOCIAL_CIRCLE 0.031276 FLAG_WORK_PHONE 0.028524 AMT_REQ_CREDIT_BUREAU_YEAR 0.019930 CNT_FAM_MEMBERS 0.009308 OBS_60_CNT_SOCIAL_CIRCLE 0.009022 REG_REGION_NOT_WORK_REGION 0.006942 REG_REGION_NOT_LIVE_REGION 0.005576 FLAG_DOCUMENT_2 0.005417 FLAG_DOCUMENT_21 0.003709 AMT_REQ_CREDIT_BUREAU_DAY 0.002704 AMT_REQ_CREDIT_BUREAU_HOUR 0.000930 AMT_REQ_CREDIT_BUREAU_WEEK 0.000788 FLAG_MOBIL 0.000534 FLAG_CONT_MOBILE 0.000370 FLAG_DOCUMENT_20 0.000215 FLAG_DOCUMENT_5 -0.000316 FLAG_DOCUMENT_12 -0.000756 FLAG_DOCUMENT_19 -0.001358 FLAG_DOCUMENT_10 -0.001414 FLAG_DOCUMENT_7 -0.001520 FLAG_EMAIL -0.001758 AMT_REQ_CREDIT_BUREAU_QRT -0.002022 SK_ID_CURR -0.002108 FLAG_DOCUMENT_4 -0.002672 FLAG_DOCUMENT_17 -0.003378 AMT_INCOME_TOTAL -0.003982 FLAG_DOCUMENT_11 -0.004229 FLAG_DOCUMENT_9 -0.004352 FLAG_DOCUMENT_15 -0.006536 FLAG_DOCUMENT_18 -0.007952 FLAG_DOCUMENT_8 -0.008040 FLAG_DOCUMENT_14 -0.009464 YEARS_BEGINEXPLUATATION_AVG -0.009728 LANDAREA_AVG -0.010885 FLAG_DOCUMENT_13 -0.011583 FLAG_DOCUMENT_16 -0.011615 AMT_REQ_CREDIT_BUREAU_MON -0.012462 AMT_ANNUITY -0.012817 ENTRANCES_AVG -0.019172 BASEMENTAREA_AVG -0.022746 FLAG_PHONE -0.023806 HOUR_APPR_PROCESS_START -0.024166 FLAG_DOCUMENT_6 -0.028602 APARTMENTS_AVG -0.029498 AMT_CREDIT -0.030369 REGION_POPULATION_RELATIVE -0.037227 FLOORSMAX_AVG -0.044003 DAYS_EMPLOYED -0.044932 EXT_SOURCE_1 -0.155317 EXT_SOURCE_2 -0.160472 EXT_SOURCE_3 -0.178919 dtype: float64
As we can see EXT_SOURCE_1, EXT_SOURCE_2 and EXT_SOURCE_3 are most negatively correlated.
target_corr = pd.DataFrame(datasets['application_train'].corrwith(datasets['application_train']["TARGET"]).sort_values(ascending = False)).reset_index()
target_corr = target_corr.rename(columns={'index':'Column_name',0:'corr_with_target'})
target_corr.head()
| Column_name | corr_with_target | |
|---|---|---|
| 0 | TARGET | 1.000000 |
| 1 | DAYS_BIRTH | 0.078239 |
| 2 | REGION_RATING_CLIENT_W_CITY | 0.060893 |
| 3 | DAYS_ID_PUBLISH | 0.051457 |
| 4 | REG_CITY_NOT_WORK_CITY | 0.050994 |
target_corr.tail()
| Column_name | corr_with_target | |
|---|---|---|
| 55 | FLOORSMAX_AVG | -0.044003 |
| 56 | DAYS_EMPLOYED | -0.044932 |
| 57 | EXT_SOURCE_1 | -0.155317 |
| 58 | EXT_SOURCE_2 | -0.160472 |
| 59 | EXT_SOURCE_3 | -0.178919 |
plt.figure(figsize=(10,10))
sns.barplot(data=target_corr,x=target_corr['Column_name'][1:], y= target_corr['corr_with_target'][1:], palette = 'BuPu')
plt.xticks(rotation=90)
plt.title('Correlation of Features with Target Variable', fontweight = 'bold', fontsize = 16)
plt.show()
Above, we have plotted the graph of correlation of all the features with the target variable. As it can be seen above, there are approximately 30% features which are not at all correlated with the target variable. We can drop them as I do not believe that they will influence our predictions of the target variable!
Removing more columns on the basis of correlation with target variable.
datasets['application_train'] = datasets['application_train'].drop(['AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_WEEK',
'FLAG_MOBIL',
'FLAG_CONT_MOBILE',
'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_7'],axis=1)
target_corr1 = pd.DataFrame(datasets['application_train'].corrwith(datasets['application_train']["TARGET"]).sort_values(ascending = False)).reset_index()
target_corr1 = target_corr1.rename(columns={'index':'Column_name',0:'corr_with_target'})
plt.figure(figsize=(10,10))
sns.barplot(data=target_corr1,x=target_corr1['Column_name'][1:], y= target_corr1['corr_with_target'][1:], palette = 'BuPu')
plt.xticks(rotation=90)
plt.title('Correlation of Features with Target Variable', fontweight = 'bold', fontsize = 16)
plt.show()
datasets['application_train'].corrwith(datasets['application_train']["TARGET"]).sort_values(ascending = False)
TARGET 1.000000 DAYS_BIRTH 0.078239 REGION_RATING_CLIENT_W_CITY 0.060893 DAYS_ID_PUBLISH 0.051457 REG_CITY_NOT_WORK_CITY 0.050994 FLAG_EMP_PHONE 0.045982 FLAG_DOCUMENT_3 0.044346 DAYS_REGISTRATION 0.041975 OWN_CAR_AGE 0.037612 DEF_60_CNT_SOCIAL_CIRCLE 0.031276 FLAG_WORK_PHONE 0.028524 AMT_REQ_CREDIT_BUREAU_YEAR 0.019930 CNT_FAM_MEMBERS 0.009308 OBS_60_CNT_SOCIAL_CIRCLE 0.009022 REG_REGION_NOT_WORK_REGION 0.006942 REG_REGION_NOT_LIVE_REGION 0.005576 FLAG_DOCUMENT_2 0.005417 FLAG_DOCUMENT_21 0.003709 AMT_REQ_CREDIT_BUREAU_DAY 0.002704 FLAG_EMAIL -0.001758 AMT_REQ_CREDIT_BUREAU_QRT -0.002022 SK_ID_CURR -0.002108 FLAG_DOCUMENT_4 -0.002672 FLAG_DOCUMENT_17 -0.003378 AMT_INCOME_TOTAL -0.003982 FLAG_DOCUMENT_11 -0.004229 FLAG_DOCUMENT_9 -0.004352 FLAG_DOCUMENT_15 -0.006536 FLAG_DOCUMENT_18 -0.007952 FLAG_DOCUMENT_8 -0.008040 FLAG_DOCUMENT_14 -0.009464 YEARS_BEGINEXPLUATATION_AVG -0.009728 LANDAREA_AVG -0.010885 FLAG_DOCUMENT_13 -0.011583 FLAG_DOCUMENT_16 -0.011615 AMT_REQ_CREDIT_BUREAU_MON -0.012462 AMT_ANNUITY -0.012817 ENTRANCES_AVG -0.019172 BASEMENTAREA_AVG -0.022746 FLAG_PHONE -0.023806 HOUR_APPR_PROCESS_START -0.024166 FLAG_DOCUMENT_6 -0.028602 APARTMENTS_AVG -0.029498 AMT_CREDIT -0.030369 REGION_POPULATION_RELATIVE -0.037227 FLOORSMAX_AVG -0.044003 DAYS_EMPLOYED -0.044932 EXT_SOURCE_1 -0.155317 EXT_SOURCE_2 -0.160472 EXT_SOURCE_3 -0.178919 dtype: float64
datasets['application_train'] = datasets['application_train'].drop(['REG_REGION_NOT_WORK_REGION',
'REG_REGION_NOT_LIVE_REGION',
'FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_21',
'AMT_REQ_CREDIT_BUREAU_DAY',
'FLAG_EMAIL',
'AMT_REQ_CREDIT_BUREAU_QRT',
'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_8'],axis=1)
df = pd.DataFrame(datasets['application_train'].corrwith(datasets['application_train']["TARGET"]).sort_values(ascending = False))
df1 = datasets['application_train'][['FLOORSMAX_AVG','DAYS_EMPLOYED','EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','TARGET']]
df1.dropna(inplace=True)
df1.head()
| FLOORSMAX_AVG | DAYS_EMPLOYED | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | TARGET | |
|---|---|---|---|---|---|---|
| 0 | 0.0833 | -637 | 0.083037 | 0.262949 | 0.139376 | 1 |
| 12 | 0.1667 | -2717 | 0.464831 | 0.715042 | 0.176653 | 0 |
| 25 | 0.1667 | -3494 | 0.561948 | 0.651406 | 0.461482 | 0 |
| 50 | 0.4083 | -1176 | 0.656225 | 0.450850 | 0.479449 | 0 |
| 51 | 0.4583 | -6977 | 0.311510 | 0.713355 | 0.309275 | 0 |
sns.pairplot(df1,palette='Purples')
plt.show()
target_corr2 = pd.DataFrame(datasets['application_train'].corrwith(datasets['application_train']["TARGET"]).sort_values(ascending = False)).reset_index()
target_corr2 = target_corr2.rename(columns={'index':'Column_name', 0:'corr_with_target'})
plt.figure(figsize=(10,10))
sns.barplot(data=target_corr2,x=target_corr2['Column_name'][1:], y= target_corr2['corr_with_target'][1:], palette = 'BuPu')
plt.xticks(rotation=90)
plt.title('Correlation of Features with Target Variable', fontweight = 'bold', fontsize = 16)
plt.show()
After we have dropped the unimportant or unrelated columns, we have plotted the correlation with the TARGET variable! This looks so much better than before. We can see that the remaning variables are quite correlated with the TARGET Variable!
datasets['application_train'].info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 307511 non-null int64 1 TARGET 307511 non-null int64 2 NAME_CONTRACT_TYPE 307511 non-null object 3 CODE_GENDER 307511 non-null object 4 FLAG_OWN_CAR 307511 non-null object 5 FLAG_OWN_REALTY 307511 non-null object 6 AMT_INCOME_TOTAL 307511 non-null float64 7 AMT_CREDIT 307511 non-null float64 8 AMT_ANNUITY 307499 non-null float64 9 NAME_TYPE_SUITE 306219 non-null object 10 NAME_INCOME_TYPE 307511 non-null object 11 NAME_EDUCATION_TYPE 307511 non-null object 12 NAME_FAMILY_STATUS 307511 non-null object 13 NAME_HOUSING_TYPE 307511 non-null object 14 REGION_POPULATION_RELATIVE 307511 non-null float64 15 DAYS_BIRTH 307511 non-null int64 16 DAYS_EMPLOYED 307511 non-null int64 17 DAYS_REGISTRATION 307511 non-null float64 18 DAYS_ID_PUBLISH 307511 non-null int64 19 OWN_CAR_AGE 104582 non-null float64 20 FLAG_EMP_PHONE 307511 non-null int64 21 FLAG_WORK_PHONE 307511 non-null int64 22 FLAG_PHONE 307511 non-null int64 23 OCCUPATION_TYPE 211120 non-null object 24 CNT_FAM_MEMBERS 307509 non-null float64 25 REGION_RATING_CLIENT_W_CITY 307511 non-null int64 26 WEEKDAY_APPR_PROCESS_START 307511 non-null object 27 HOUR_APPR_PROCESS_START 307511 non-null int64 28 REG_CITY_NOT_WORK_CITY 307511 non-null int64 29 ORGANIZATION_TYPE 307511 non-null object 30 EXT_SOURCE_1 134133 non-null float64 31 EXT_SOURCE_2 306851 non-null float64 32 EXT_SOURCE_3 246546 non-null float64 33 APARTMENTS_AVG 151450 non-null float64 34 BASEMENTAREA_AVG 127568 non-null float64 35 YEARS_BEGINEXPLUATATION_AVG 157504 non-null float64 36 ENTRANCES_AVG 152683 non-null float64 37 FLOORSMAX_AVG 154491 non-null float64 38 LANDAREA_AVG 124921 non-null float64 39 FONDKAPREMONT_MODE 97216 non-null object 40 HOUSETYPE_MODE 153214 non-null object 41 WALLSMATERIAL_MODE 151170 non-null object 42 EMERGENCYSTATE_MODE 161756 non-null object 43 OBS_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 44 DEF_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 45 FLAG_DOCUMENT_3 307511 non-null int64 46 FLAG_DOCUMENT_6 307511 non-null int64 47 FLAG_DOCUMENT_13 307511 non-null int64 48 FLAG_DOCUMENT_14 307511 non-null int64 49 FLAG_DOCUMENT_16 307511 non-null int64 50 AMT_REQ_CREDIT_BUREAU_MON 265992 non-null float64 51 AMT_REQ_CREDIT_BUREAU_YEAR 265992 non-null float64 dtypes: float64(20), int64(16), object(16) memory usage: 122.0+ MB
The info() function is used to print a concise summary of a DataFrame. This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.
we see that we have a lot of float, numeric, int and object data type values in the dataset!
sns.set_style('whitegrid')
plt.figure(figsize = (10,10))
plt.title("Whos is the highest borrower?", fontweight = 'bold', fontsize = 16)
sns.countplot(x='CODE_GENDER',data=datasets['application_train'], palette = 'Purples')
<AxesSubplot:title={'center':'Whos is the highest borrower?'}, xlabel='CODE_GENDER', ylabel='count'>
Basically, there are more number of females than the number of males in the dataset! Thus one may also conclude that FEMALES take more loans than the MALES! But it might be incorrect to assume that!
plt.figure(figsize=(10,10))
plt.title("Who owns most number of the cars?", fontweight = 'bold', fontsize = 16)
sns.countplot(x='CODE_GENDER',data= datasets['application_train'], hue='TARGET', palette="Purples")
<AxesSubplot:title={'center':'Who owns most number of the cars?'}, xlabel='CODE_GENDER', ylabel='count'>
In this plot we have drawn a graph whther or not person i.e man or woman defaults a loan. However, the count of females is higher than the count of males, the graph says something different. The number of females owning a car is more than the number of males owning a car and percent wise they default less.
plt.figure(figsize=(10,10))
plt.title("Whether is it Female who has more difficulties or is it Male?", fontweight = 'bold', fontsize= 16)
sns.countplot(x='TARGET',hue='CODE_GENDER',data=datasets['application_train'], palette = 'Purples')
<AxesSubplot:title={'center':'Whether is it Female who has more difficulties or is it Male?'}, xlabel='TARGET', ylabel='count'>
plt.figure(figsize=(10,10))
plt.title("Loan Default according to the Type of Loans",fontweight='bold' , fontsize =16)
sns.countplot(x='TARGET',hue='NAME_CONTRACT_TYPE',data=datasets['application_train'], palette = 'Purples')
<AxesSubplot:title={'center':'Loan Default according to the Type of Loans'}, xlabel='TARGET', ylabel='count'>
plt.figure(figsize=(10,10))
plt.title("Loan Default with respect to property owned",fontweight='bold' , fontsize =16)
sns.countplot(x='TARGET',hue='FLAG_OWN_REALTY',data=datasets['application_train'], palette = 'Purples')
<AxesSubplot:title={'center':'Loan Default with respect to property owned'}, xlabel='TARGET', ylabel='count'>
plt.figure(figsize=(10,10))
plt.title("Loan Default with respect to type of suite",fontweight='bold' , fontsize =16)
sns.countplot(x='NAME_TYPE_SUITE',hue='TARGET',data=datasets['application_train'], palette = 'Purples')
<AxesSubplot:title={'center':'Loan Default with respect to type of suite'}, xlabel='NAME_TYPE_SUITE', ylabel='count'>
plt.figure(figsize=(10,10))
plt.title("Loan Default with respect to education type",fontweight='bold' , fontsize =16)
sns.countplot(x='NAME_EDUCATION_TYPE',hue='TARGET',data=datasets['application_train'], palette = 'Purples')
<AxesSubplot:title={'center':'Loan Default with respect to education type'}, xlabel='NAME_EDUCATION_TYPE', ylabel='count'>
plt.figure(figsize=(10,10))
plt.title("Loan Default with respect to income type",fontweight='bold' , fontsize =16)
sns.countplot(x='NAME_INCOME_TYPE',hue='TARGET',data=datasets['application_train'], palette = 'Purples')
<AxesSubplot:title={'center':'Loan Default with respect to income type'}, xlabel='NAME_INCOME_TYPE', ylabel='count'>
datasets['application_train'].info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 52 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 307511 non-null int64 1 TARGET 307511 non-null int64 2 NAME_CONTRACT_TYPE 307511 non-null object 3 CODE_GENDER 307511 non-null object 4 FLAG_OWN_CAR 307511 non-null object 5 FLAG_OWN_REALTY 307511 non-null object 6 AMT_INCOME_TOTAL 307511 non-null float64 7 AMT_CREDIT 307511 non-null float64 8 AMT_ANNUITY 307499 non-null float64 9 NAME_TYPE_SUITE 306219 non-null object 10 NAME_INCOME_TYPE 307511 non-null object 11 NAME_EDUCATION_TYPE 307511 non-null object 12 NAME_FAMILY_STATUS 307511 non-null object 13 NAME_HOUSING_TYPE 307511 non-null object 14 REGION_POPULATION_RELATIVE 307511 non-null float64 15 DAYS_BIRTH 307511 non-null int64 16 DAYS_EMPLOYED 307511 non-null int64 17 DAYS_REGISTRATION 307511 non-null float64 18 DAYS_ID_PUBLISH 307511 non-null int64 19 OWN_CAR_AGE 104582 non-null float64 20 FLAG_EMP_PHONE 307511 non-null int64 21 FLAG_WORK_PHONE 307511 non-null int64 22 FLAG_PHONE 307511 non-null int64 23 OCCUPATION_TYPE 211120 non-null object 24 CNT_FAM_MEMBERS 307509 non-null float64 25 REGION_RATING_CLIENT_W_CITY 307511 non-null int64 26 WEEKDAY_APPR_PROCESS_START 307511 non-null object 27 HOUR_APPR_PROCESS_START 307511 non-null int64 28 REG_CITY_NOT_WORK_CITY 307511 non-null int64 29 ORGANIZATION_TYPE 307511 non-null object 30 EXT_SOURCE_1 134133 non-null float64 31 EXT_SOURCE_2 306851 non-null float64 32 EXT_SOURCE_3 246546 non-null float64 33 APARTMENTS_AVG 151450 non-null float64 34 BASEMENTAREA_AVG 127568 non-null float64 35 YEARS_BEGINEXPLUATATION_AVG 157504 non-null float64 36 ENTRANCES_AVG 152683 non-null float64 37 FLOORSMAX_AVG 154491 non-null float64 38 LANDAREA_AVG 124921 non-null float64 39 FONDKAPREMONT_MODE 97216 non-null object 40 HOUSETYPE_MODE 153214 non-null object 41 WALLSMATERIAL_MODE 151170 non-null object 42 EMERGENCYSTATE_MODE 161756 non-null object 43 OBS_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 44 DEF_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 45 FLAG_DOCUMENT_3 307511 non-null int64 46 FLAG_DOCUMENT_6 307511 non-null int64 47 FLAG_DOCUMENT_13 307511 non-null int64 48 FLAG_DOCUMENT_14 307511 non-null int64 49 FLAG_DOCUMENT_16 307511 non-null int64 50 AMT_REQ_CREDIT_BUREAU_MON 265992 non-null float64 51 AMT_REQ_CREDIT_BUREAU_YEAR 265992 non-null float64 dtypes: float64(20), int64(16), object(16) memory usage: 122.0+ MB
categorical_df= datasets['application_train'].select_dtypes(include='object')
categorical_df
| NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | OCCUPATION_TYPE | WEEKDAY_APPR_PROCESS_START | ORGANIZATION_TYPE | FONDKAPREMONT_MODE | HOUSETYPE_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Cash loans | M | N | Y | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | Laborers | WEDNESDAY | Business Entity Type 3 | reg oper account | block of flats | Stone, brick | No |
| 1 | Cash loans | F | N | N | Family | State servant | Higher education | Married | House / apartment | Core staff | MONDAY | School | reg oper account | block of flats | Block | No |
| 2 | Revolving loans | M | Y | Y | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | Laborers | MONDAY | Government | NaN | NaN | NaN | NaN |
| 3 | Cash loans | F | N | Y | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | Laborers | WEDNESDAY | Business Entity Type 3 | NaN | NaN | NaN | NaN |
| 4 | Cash loans | M | N | Y | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | Core staff | THURSDAY | Religion | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | Cash loans | M | N | N | Unaccompanied | Working | Secondary / secondary special | Separated | With parents | Sales staff | THURSDAY | Services | reg oper account | block of flats | Stone, brick | No |
| 307507 | Cash loans | F | N | Y | Unaccompanied | Pensioner | Secondary / secondary special | Widow | House / apartment | NaN | MONDAY | XNA | reg oper account | block of flats | Stone, brick | No |
| 307508 | Cash loans | F | N | Y | Unaccompanied | Working | Higher education | Separated | House / apartment | Managers | THURSDAY | School | reg oper account | block of flats | Panel | No |
| 307509 | Cash loans | F | N | Y | Unaccompanied | Commercial associate | Secondary / secondary special | Married | House / apartment | Laborers | WEDNESDAY | Business Entity Type 1 | NaN | block of flats | Stone, brick | No |
| 307510 | Cash loans | F | N | N | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | Laborers | THURSDAY | Business Entity Type 3 | NaN | block of flats | Panel | No |
307511 rows × 16 columns
for cat in categorical_df.columns.values[:11]:
plt.figure(figsize=(10,10))
title= "Loan Default with respect to "+ str(cat)
plt.title(title,fontweight='bold' , fontsize =16)
sns.countplot(x=categorical_df[cat],hue='TARGET',data=datasets['application_train'], palette = 'Purples')
plt.xticks(rotation=90)
integer_df1= datasets['application_train'].select_dtypes(include='int64')
integer_df1
float_df1 = datasets['application_train'].select_dtypes(include='float64')
integer_df1
| SK_ID_CURR | TARGET | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_ID_PUBLISH | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_PHONE | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_CITY_NOT_WORK_CITY | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_16 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | -9461 | -637 | -2120 | 1 | 0 | 1 | 2 | 10 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 100003 | 0 | -16765 | -1188 | -291 | 1 | 0 | 1 | 1 | 11 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 100004 | 0 | -19046 | -225 | -2531 | 1 | 1 | 1 | 2 | 9 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 100006 | 0 | -19005 | -3039 | -2437 | 1 | 0 | 0 | 2 | 17 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 100007 | 0 | -19932 | -3038 | -3458 | 1 | 0 | 0 | 2 | 11 | 1 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307506 | 456251 | 0 | -9327 | -236 | -1982 | 1 | 0 | 0 | 1 | 15 | 0 | 0 | 0 | 0 | 0 | 0 |
| 307507 | 456252 | 0 | -20775 | 365243 | -4090 | 0 | 0 | 1 | 2 | 8 | 0 | 1 | 0 | 0 | 0 | 0 |
| 307508 | 456253 | 0 | -14966 | -7921 | -5150 | 1 | 0 | 0 | 3 | 9 | 1 | 1 | 0 | 0 | 0 | 0 |
| 307509 | 456254 | 1 | -11961 | -4786 | -931 | 1 | 0 | 0 | 2 | 9 | 1 | 1 | 0 | 0 | 0 | 0 |
| 307510 | 456255 | 0 | -16856 | -1262 | -410 | 1 | 1 | 1 | 1 | 20 | 1 | 1 | 0 | 0 | 0 | 0 |
307511 rows × 16 columns
for integer in integer_df1.columns.values[6:9]:
plt.figure(figsize=(10,10))
title= "Loan Default with respect to "+ str(integer)
plt.title(title,fontweight='bold' , fontsize =16)
sns.countplot(x=integer_df1[integer],hue='TARGET',data=datasets['application_train'], palette = 'Purples')
plt.xticks(rotation=90)
fig, ax = plt.subplots(figsize=(14,7))
sns.countplot(x='OCCUPATION_TYPE',data=datasets['application_train'], palette = 'Purples')
plt.xlabel("Occupation Type")
plt.title("Who are the major borrowers? - What are their occupations?", fontweight='bold',fontsize=16)
plt.xticks(rotation=70)
plt.show()
fig, ax = plt.subplots(figsize=(15,7))
sns.barplot(x='OCCUPATION_TYPE',y='AMT_INCOME_TOTAL',hue='TARGET',data=datasets['application_train'],palette='Purples')
plt.xticks(rotation=70)
plt.xlabel("Occupation Type")
plt.ylabel("Average Annual family income")
plt.title("How economically stable are clients? Who are the most and least stable?",fontweight='bold',fontsize=16)
Text(0.5, 1.0, 'How economically stable are clients? Who are the most and least stable?')
datasets['application_train'].select_dtypes('object').apply(pd.Series.nunique, axis = 0)
NAME_CONTRACT_TYPE 2 CODE_GENDER 3 FLAG_OWN_CAR 2 FLAG_OWN_REALTY 2 NAME_TYPE_SUITE 7 NAME_INCOME_TYPE 8 NAME_EDUCATION_TYPE 5 NAME_FAMILY_STATUS 6 NAME_HOUSING_TYPE 6 OCCUPATION_TYPE 18 WEEKDAY_APPR_PROCESS_START 7 ORGANIZATION_TYPE 58 FONDKAPREMONT_MODE 4 HOUSETYPE_MODE 3 WALLSMATERIAL_MODE 7 EMERGENCYSTATE_MODE 2 dtype: int64
fig, ax = plt.subplots(figsize=(15,7))
sns.countplot(x='OCCUPATION_TYPE',hue='TARGET',data=datasets['application_train'],palette="Purples")
plt.xticks(rotation=70)
plt.xlabel("Occupation Type")
plt.title('Which category of occupants repay on time and are better clients for company to lend money?')
Text(0.5, 1.0, 'Which category of occupants repay on time and are better clients for company to lend money?')
Data preprocessing is an integral step in Machine Learning as the quality of data and the useful information that can be derived from it directly affects the ability of our model to learn; therefore, it is extremely important that we preprocess our data before feeding it into our model. The concepts that I will cover for this section of the project are-
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.compose import ColumnTransformer
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.model_selection import train_test_split # sklearn.cross_validation in old versions
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from time import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
We are going to split the data into train and test Data so that we can perform a check on our model afterwards.
from sklearn.model_selection import train_test_split
datasets['application_train'].head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | ... | EMERGENCYSTATE_MODE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_16 | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | ... | No | 2.0 | 2.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 270000.0 | 1293502.5 | 35698.5 | Family | ... | No | 1.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 67500.0 | 135000.0 | 6750.0 | Unaccompanied | ... | NaN | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 135000.0 | 312682.5 | 29686.5 | Unaccompanied | ... | NaN | 2.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 121500.0 | 513000.0 | 21865.5 | Unaccompanied | ... | NaN | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
5 rows × 52 columns
X = datasets['application_train'].drop(['TARGET'],axis=1)
y = datasets['application_train']['TARGET']
y = pd.DataFrame(y, columns = ["TARGET"])
y
| TARGET | |
|---|---|
| 0 | 1 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| ... | ... |
| 307506 | 0 |
| 307507 | 0 |
| 307508 | 0 |
| 307509 | 1 |
| 307510 | 0 |
307511 rows × 1 columns
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2,random_state=42)
integer_df= X.select_dtypes(include='int64')
float_df = X.select_dtypes(include='float64')
numerical_df = list(pd.concat([integer_df,float_df], axis=1))
categorical_df = list(X.select_dtypes(include='object'))
print(categorical_df)
print(' ')
print(numerical_df)
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
['SK_ID_CURR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_16', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'LANDAREA_AVG', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_YEAR']
num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'median'))
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_pipeline = ColumnTransformer([
("num_pipeline", num_pipeline, numerical_df),
("cat_pipeline", cat_pipeline, categorical_df)], remainder = 'drop')
X_train_transformed = data_pipeline.fit_transform(X_train)
column_names = numerical_df + \
list(data_pipeline.transformers_[1][1].named_steps["ohe"].get_feature_names(categorical_df))
pd.DataFrame(X_train_transformed, columns=column_names).head()
| SK_ID_CURR | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_ID_PUBLISH | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_PHONE | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_CITY_NOT_WORK_CITY | ... | HOUSETYPE_MODE_terraced house | WALLSMATERIAL_MODE_Block | WALLSMATERIAL_MODE_Mixed | WALLSMATERIAL_MODE_Monolithic | WALLSMATERIAL_MODE_Others | WALLSMATERIAL_MODE_Panel | WALLSMATERIAL_MODE_Stone, brick | WALLSMATERIAL_MODE_Wooden | EMERGENCYSTATE_MODE_No | EMERGENCYSTATE_MODE_Yes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.341075 | -1.676136 | 2.132885 | -0.937673 | -2.132831 | -0.498325 | -0.624639 | -0.063023 | -0.937369 | -0.546756 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | -1.619000 | 1.173499 | -0.452704 | -0.269054 | 0.468860 | 2.006722 | 1.600924 | -0.063023 | -1.549357 | -0.546756 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
| 2 | -1.003644 | -1.644952 | -0.467140 | -1.335928 | 0.468860 | -0.498325 | -0.624639 | -0.063023 | 0.286606 | -0.546756 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 3 | 0.912957 | 1.223484 | -0.469780 | 0.160347 | 0.468860 | 2.006722 | 1.600924 | -0.063023 | -0.631375 | 1.828969 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 4 | 0.925258 | 0.055244 | -0.476786 | -1.330627 | 0.468860 | -0.498325 | -0.624639 | -0.063023 | -0.019388 | -0.546756 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows × 175 columns
Now that we have explored the data, cleaned it, preprocessed it and added a new feature to it, we can start the modeling part of the project by applying Machine Learning algorithms. In this section, you will have a baseline logistic regression model and grid searches on different models. In the end, you will find out which parameters are the best for each algorithm and you will be able to compare the performance of the models with the baseline model.
from IPython.display import Image
Image(url= "https://prwatech.in/blog/wp-content/uploads/2020/02/logi1.png", width=900, height=900)
Logistic regression (LR) is a statistical method similar to linear regression since LR finds an equation that predicts an outcome for a binary variable, Y, from one or more response variables, X. However, unlike linear regression the response variables can be categorical or continuous, as the model does not strictly require continuous data. To predict group membership, LR uses the log odds ratio rather than probabilities and an iterative maximum likelihood method rather than a least squares to fit the final model. This means the researcher has more freedom when using LR and the method may be more appropriate for nonnormally distributed data or when the samples have unequal covariance matrices. Logistic regression assumes independence among variables, which is not always met in morphoscopic datasets.
from IPython.display import Image
Image(url= "https://www.researchgate.net/publication/327512672/figure/fig2/AS:668286433914884@1536343310243/Basic-sigmoid-function-Note-A-Basic-sigmoid-function-with-two-parameters-c-1-and-c-2.jpg", width=900, height=900)
clf_pipe = make_pipeline(data_pipeline, LogisticRegression())
# clf_pipe.fit(X_train, y_train)
# train_acc = clf_pipe.score(X_train, y_train)
# # validAcc = clf_pipe.score(X_valid, y_valid)
# testAcc = clf_pipe.score(X_test, y_test)
# print(train_acc,testAcc)
start = time()
clf_pipe.fit(X_train, y_train)
train_time = np.round(time() - start, 4)
trainAcc = clf_pipe.score(X_train, y_train)
validAcc = clf_pipe.score(X_test, y_test)
start = time()
testAcc = clf_pipe.score(X_test, y_test)
test_time = np.round(time() - start, 4)
#del experimentLog
try: experimentLog
except : experimentLog = pd.DataFrame(columns=["Pipeline", "Dataset", "TrainAcc", "ValidAcc", "TestAcc",
"Train Time(s)", "Test Time(s)", "Description",])
experimentLog.loc[len(experimentLog)] =[f"Baseline 1 LogReg", "HCDR",
f"{trainAcc*100:8.2f}%", f"{validAcc*100:8.2f}%", f"{testAcc*100:8.2f}%",
train_time, test_time,
"Baseline 1 LogReg pipeline with Cat+Num features"]
experimentLog
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | Train Time(s) | Test Time(s) | Description | |
|---|---|---|---|---|---|---|---|---|
| 0 | Baseline 1 LogReg | HCDR | 91.91% | 91.94% | 91.94% | 17.1005 | 0.4038 | Baseline 1 LogReg pipeline with Cat+Num features |
y_pred = clf_pipe.predict(X_test)
res = clf_pipe.predict_proba(X_train)
res
array([[0.95176856, 0.04823144],
[0.8854846 , 0.1145154 ],
[0.95997397, 0.04002603],
...,
[0.96730581, 0.03269419],
[0.94754299, 0.05245701],
[0.87877635, 0.12122365]])
Prediction probability of Class 0 and Class 1
print('Validation set accuracy score: ' + str(accuracy_score(y_test,y_pred)))
Validation set accuracy score: 0.919434824317513
Image(url= "https://miro.medium.com/max/1192/1*wilGXrItaMAJmZNl6RJq9Q.png", width=900, height=900)
roc_auc_score(y_test, clf_pipe.predict_proba(X_test)[:, 1])
0.7417386986188286
import matplotlib.pyplot as plt
from sklearn import metrics
metrics.plot_roc_curve(clf_pipe, X_test, y_test)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f42db3f0af0>
from sklearn.metrics import accuracy_score, recall_score, plot_roc_curve, confusion_matrix, classification_report, precision_recall_curve, auc
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,
fmt='.2%', cmap='Purples')
plt.title('Confusion Matrix of Gradient Boost Classifier', fontweight='bold', fontsize=16)
plt.xlabel('Predicted', fontweight='bold', fontsize=12)
plt.ylabel('Actual', fontweight='bold', fontsize=12)
Text(33.0, 0.5, 'Actual')
Random Forest works in two-phase first is to create the random forest by combining N decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Image(url= "https://cdn.corporatefinanceinstitute.com/assets/random-forest.png", width=900, height=900)
Image(url= "https://miro.medium.com/max/1400/1*cBaV_bLVW-gT6PkG5ve26A.png", width=900, height=900)
# def model_rf(X_train):
RF = RandomForestClassifier(random_state = 42,n_estimators=20, criterion='gini', max_depth=6)
data_pipeline_rf = make_pipeline(data_pipeline, RF)
start = time()
data_pipeline_rf.fit(X_train, y_train)
train_time = np.round(time() - start, 4)
train_acc = data_pipeline_rf.score(X_train, y_train)
validAcc = data_pipeline_rf.score(X_test, y_test)
start = time()
testAcc = data_pipeline_rf.score(X_test, y_test)
test_time = np.round(time() - start, 4)
predictions = data_pipeline_rf.predict_proba(X_test)
print ("ROC_AUC_Score",roc_auc_score(y_test, predictions[:,1]))
fpr, tpr, _ = roc_curve(y_test, predictions[:,1])
plt.clf()
plt.plot(fpr, tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve',fontweight='bold',fontsize=20)
plt.show()
try: experimentLog
except : experimentLog = pd.DataFrame(columns=["Pipeline", "Dataset", "TrainAcc", "ValidAcc", "TestAcc",
"Train Time(s)", "Test Time(s)", "Description",])
experimentLog.loc[len(experimentLog)] =[f"Baseline 1 RandomForest", "HCDR",
f"{trainAcc*100:8.2f}%", f"{validAcc*100:8.2f}%", f"{testAcc*100:8.2f}%",
train_time, test_time,
"Baseline 1 RandomForest pipeline with Cat+Num features"]
experimentLog
ROC_AUC_Score 0.7253912369656725
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | Train Time(s) | Test Time(s) | Description | |
|---|---|---|---|---|---|---|---|---|
| 0 | Baseline 1 LogReg | HCDR | 91.91% | 91.94% | 91.94% | 17.1005 | 0.4038 | Baseline 1 LogReg pipeline with Cat+Num features |
| 1 | Baseline 1 RandomForest | HCDR | 91.91% | 91.95% | 91.95% | 10.7311 | 0.4899 | Baseline 1 RandomForest pipeline with Cat+Num ... |
y_pred_rf = data_pipeline_rf.predict(X_test)
res = data_pipeline_rf.predict_proba(X_train)
res
array([[0.94402268, 0.05597732],
[0.88961809, 0.11038191],
[0.93682042, 0.06317958],
...,
[0.95657847, 0.04342153],
[0.91790688, 0.08209312],
[0.89067702, 0.10932298]])
print('Validation set accuracy score: ' + str(accuracy_score(y_test,y_pred_rf
)))
Validation set accuracy score: 0.9195323805342829
log_loss(y_test,y_pred_rf)
2.7792531157571663
roc_auc_score(y_test, clf_pipe.predict_proba(X_test)[:, 1])
0.7417386986188286
import matplotlib.pyplot as plt
from sklearn import metrics
metrics.plot_roc_curve(clf_pipe, X_test, y_test)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f42dd21b1f0>
X_train
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | ... | EMERGENCYSTATE_MODE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_16 | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 123473 | 243191 | Cash loans | F | Y | N | 171000.0 | 555273.0 | 16366.5 | Unaccompanied | Pensioner | ... | NaN | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 |
| 10118 | 111778 | Cash loans | M | N | Y | 157500.0 | 198085.5 | 23638.5 | Unaccompanied | Working | ... | No | 1.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 |
| 64716 | 175057 | Cash loans | M | Y | Y | 135000.0 | 776304.0 | 25173.0 | Unaccompanied | Working | ... | NaN | 2.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 5.0 |
| 234940 | 372147 | Cash loans | M | Y | Y | 164133.0 | 900000.0 | 36787.5 | Unaccompanied | Commercial associate | ... | No | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 3.0 |
| 236051 | 373412 | Cash loans | M | N | Y | 225000.0 | 533668.5 | 21294.0 | Spouse, partner | Commercial associate | ... | NaN | 6.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 4.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 119879 | 239009 | Cash loans | F | N | Y | 121500.0 | 675000.0 | 32602.5 | Unaccompanied | Working | ... | NaN | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | NaN | NaN |
| 259178 | 399937 | Cash loans | F | N | N | 225000.0 | 509922.0 | 40419.0 | Spouse, partner | Commercial associate | ... | NaN | 5.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 4.0 |
| 131932 | 253016 | Cash loans | M | Y | Y | 450000.0 | 1125000.0 | 33025.5 | Unaccompanied | Working | ... | No | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 2.0 | 3.0 |
| 146867 | 270275 | Revolving loans | M | Y | Y | 180000.0 | 270000.0 | 13500.0 | Unaccompanied | Working | ... | No | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 1.0 |
| 121958 | 241394 | Cash loans | M | Y | Y | 157500.0 | 729792.0 | 35239.5 | Family | Working | ... | NaN | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 1.0 | 0.0 |
246008 rows × 51 columns
from sklearn.metrics import accuracy_score, recall_score, plot_roc_curve, confusion_matrix, classification_report, precision_recall_curve, auc
cf_matrix = confusion_matrix(y_test, y_pred)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,
fmt='.2%', cmap='Purples')
plt.title('Confusion Matrix of Gradient Boost Classifier', fontweight='bold', fontsize=16)
plt.xlabel('Predicted', fontweight='bold', fontsize=12)
plt.ylabel('Actual', fontweight='bold', fontsize=12)
Text(33.0, 0.5, 'Actual')
datasets['bureau'].head()
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
datasets['bureau']['CREDIT_ACTIVE'].value_counts()
Closed 1079273 Active 630607 Sold 6527 Bad debt 21 Name: CREDIT_ACTIVE, dtype: int64
on SK_ID_BUREAU
df_bureau1 = pd.merge(left = datasets['bureau'], right = datasets['bureau_balance'], how='inner', left_on='SK_ID_BUREAU', right_on='SK_ID_BUREAU')
df_bureau1.shape
(24179741, 19)
datasets['application_train'].head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | ... | EMERGENCYSTATE_MODE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_16 | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | ... | No | 2.0 | 2.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 270000.0 | 1293502.5 | 35698.5 | Family | ... | No | 1.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 67500.0 | 135000.0 | 6750.0 | Unaccompanied | ... | NaN | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 135000.0 | 312682.5 | 29686.5 | Unaccompanied | ... | NaN | 2.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 121500.0 | 513000.0 | 21865.5 | Unaccompanied | ... | NaN | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
5 rows × 52 columns
Making a test dataframe which contains only target with SK_ID_CURR so that we can check the correlation with the merged bureau column so that we can drop irrelevant columns.
df_test = datasets['application_train'][['SK_ID_CURR','TARGET']]
df_test
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100002 | 1 |
| 1 | 100003 | 0 |
| 2 | 100004 | 0 |
| 3 | 100006 | 0 |
| 4 | 100007 | 0 |
| ... | ... | ... |
| 307506 | 456251 | 0 |
| 307507 | 456252 | 0 |
| 307508 | 456253 | 0 |
| 307509 | 456254 | 1 |
| 307510 | 456255 | 0 |
307511 rows × 2 columns
df_mergewithtarget = pd.merge(left = df_bureau1, right = df_test,
how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
df_mergewithtarget.shape
(24179741, 20)
!pip install missingno
Requirement already satisfied: missingno in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (0.5.0) Requirement already satisfied: numpy in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from missingno) (1.21.4) Requirement already satisfied: matplotlib in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from missingno) (3.5.0) Requirement already satisfied: seaborn in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from missingno) (0.11.2) Requirement already satisfied: scipy in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from missingno) (1.7.3) Requirement already satisfied: pillow>=6.2.0 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from matplotlib->missingno) (8.4.0) Requirement already satisfied: python-dateutil>=2.7 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from matplotlib->missingno) (2.8.2) Requirement already satisfied: fonttools>=4.22.0 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from matplotlib->missingno) (4.28.3) Requirement already satisfied: kiwisolver>=1.0.1 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from matplotlib->missingno) (1.3.2) Requirement already satisfied: cycler>=0.10 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from matplotlib->missingno) (0.11.0) Requirement already satisfied: packaging>=20.0 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from matplotlib->missingno) (21.3) Requirement already satisfied: setuptools-scm>=4 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from matplotlib->missingno) (6.3.2) Requirement already satisfied: pyparsing>=2.2.1 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from matplotlib->missingno) (3.0.6) Requirement already satisfied: pandas>=0.23 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from seaborn->missingno) (1.3.4) Requirement already satisfied: pytz>=2017.3 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from pandas>=0.23->seaborn->missingno) (2021.3) Requirement already satisfied: six>=1.5 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0) Requirement already satisfied: tomli>=1.0.0 in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from setuptools-scm>=4->matplotlib->missingno) (1.2.2) Requirement already satisfied: setuptools in /geode2/home/u060/nmalpani/Carbonate/miniconda3/lib/python3.9/site-packages (from setuptools-scm>=4->matplotlib->missingno) (52.0.0.post20210125)
import missingno as msno
df_mergewithtarget1 = df_mergewithtarget.dropna(subset=['TARGET'])
msno.bar(df_mergewithtarget1)
<AxesSubplot:>
plt.figure(figsize = (20,20))
sns.heatmap(df_mergewithtarget1.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
df_mergewithtarget1.corrwith(df_mergewithtarget1["TARGET"]).sort_values(ascending = False)
TARGET 1.000000 DAYS_CREDIT 0.047350 DAYS_CREDIT_UPDATE 0.033169 DAYS_ENDDATE_FACT 0.031562 MONTHS_BALANCE 0.027773 DAYS_CREDIT_ENDDATE 0.014881 AMT_CREDIT_SUM_OVERDUE 0.004875 AMT_CREDIT_SUM_DEBT 0.001840 CNT_CREDIT_PROLONG 0.000929 SK_ID_CURR 0.000909 CREDIT_DAY_OVERDUE -0.000128 AMT_CREDIT_MAX_OVERDUE -0.000204 AMT_ANNUITY -0.000389 AMT_CREDIT_SUM_LIMIT -0.006195 SK_ID_BUREAU -0.007517 AMT_CREDIT_SUM -0.010691 dtype: float64
df_bureau1 = df_bureau1.drop_duplicates()
df_bureau1['CREDIT_ACTIVE'].value_counts()
Closed 18511853 Active 5545692 Sold 122096 Bad debt 100 Name: CREDIT_ACTIVE, dtype: int64
df_bureau1['AMT_CREDIT_SUM_LIMIT'].isnull().sum()
10376144
df_bureau1['AMT_CREDIT_SUM_LIMIT'] = df_bureau1['AMT_CREDIT_SUM_LIMIT'].replace(np.nan,0)
df_bureau1
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | MONTHS_BALANCE | STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 380361 | 5715448 | Active | currency 1 | -820 | 0 | 31069.0 | NaN | NaN | 0 | 67500.0 | 0.0 | 67500.0 | 0.0 | Credit card | -183 | 0.0 | 0 | C |
| 1 | 380361 | 5715448 | Active | currency 1 | -820 | 0 | 31069.0 | NaN | NaN | 0 | 67500.0 | 0.0 | 67500.0 | 0.0 | Credit card | -183 | 0.0 | -1 | C |
| 2 | 380361 | 5715448 | Active | currency 1 | -820 | 0 | 31069.0 | NaN | NaN | 0 | 67500.0 | 0.0 | 67500.0 | 0.0 | Credit card | -183 | 0.0 | -2 | C |
| 3 | 380361 | 5715448 | Active | currency 1 | -820 | 0 | 31069.0 | NaN | NaN | 0 | 67500.0 | 0.0 | 67500.0 | 0.0 | Credit card | -183 | 0.0 | -3 | C |
| 4 | 380361 | 5715448 | Active | currency 1 | -820 | 0 | 31069.0 | NaN | NaN | 0 | 67500.0 | 0.0 | 67500.0 | 0.0 | Credit card | -183 | 0.0 | -4 | C |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 24179736 | 407724 | 5053758 | Closed | currency 1 | -2423 | 0 | -2150.0 | -2175.0 | NaN | 0 | 21883.5 | 0.0 | 0.0 | 0.0 | Consumer credit | -787 | NaN | -75 | X |
| 24179737 | 407724 | 5053758 | Closed | currency 1 | -2423 | 0 | -2150.0 | -2175.0 | NaN | 0 | 21883.5 | 0.0 | 0.0 | 0.0 | Consumer credit | -787 | NaN | -76 | X |
| 24179738 | 407724 | 5053758 | Closed | currency 1 | -2423 | 0 | -2150.0 | -2175.0 | NaN | 0 | 21883.5 | 0.0 | 0.0 | 0.0 | Consumer credit | -787 | NaN | -77 | X |
| 24179739 | 407724 | 5053758 | Closed | currency 1 | -2423 | 0 | -2150.0 | -2175.0 | NaN | 0 | 21883.5 | 0.0 | 0.0 | 0.0 | Consumer credit | -787 | NaN | -78 | X |
| 24179740 | 407724 | 5053758 | Closed | currency 1 | -2423 | 0 | -2150.0 | -2175.0 | NaN | 0 | 21883.5 | 0.0 | 0.0 | 0.0 | Consumer credit | -787 | NaN | -79 | X |
24179741 rows × 19 columns
df_bureau1['AMT_CREDIT_SUM_LIMIT'].value_counts()
0.000 23537973
135000.000 14315
13500.000 12765
45000.000 11447
90000.000 8734
...
62694.360 1
8651.295 1
21375.000 1
14805.000 1
2220.030 1
Name: AMT_CREDIT_SUM_LIMIT, Length: 22040, dtype: int64
msno.bar(df_bureau1)
<AxesSubplot:>
df_bureau1 = df_bureau1.groupby('SK_ID_CURR').median()
df_bureau1
| SK_ID_BUREAU | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | MONTHS_BALANCE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | ||||||||||||||
| 100001 | 5896632.0 | -879.0 | 0.0 | -492.0 | -877.0 | NaN | 0.0 | 112500.0 | 0.000 | 0.0 | 0.0 | -155.0 | 0.0 | -14.0 |
| 100002 | 6158905.0 | -1043.0 | 0.0 | -911.0 | -967.0 | 0.0 | 0.0 | 40761.0 | 0.000 | 0.0 | 0.0 | -758.0 | 0.0 | -26.0 |
| 100005 | 6735200.0 | -373.0 | 0.0 | -128.0 | -123.0 | 0.0 | 0.0 | 58500.0 | 0.000 | 0.0 | 0.0 | -121.0 | 0.0 | -3.0 |
| 100010 | 5576631.0 | -1939.5 | 0.0 | -119.5 | -1138.0 | NaN | 0.0 | 495000.0 | 174003.750 | 0.0 | 0.0 | -578.0 | NaN | -46.0 |
| 100013 | 5922082.0 | -2005.0 | 0.0 | -909.0 | -1218.0 | 19305.0 | 0.0 | 688500.0 | 0.000 | 0.0 | 0.0 | -1215.0 | 0.0 | -28.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 456247 | 6618241.0 | -1757.0 | 0.0 | -1402.0 | -1402.0 | 2898.0 | 0.0 | 143667.0 | 0.000 | 0.0 | 0.0 | -1402.0 | 4329.0 | -22.0 |
| 456250 | 6817237.0 | -824.0 | 0.0 | 1797.0 | -760.0 | 0.0 | 0.0 | 483349.5 | 391731.615 | 0.0 | 0.0 | -31.0 | 51799.5 | -14.0 |
| 456253 | 6098498.0 | -919.0 | 0.0 | -189.0 | -794.0 | NaN | 0.0 | 675000.0 | 0.000 | 0.0 | 0.0 | -179.0 | 58369.5 | -14.0 |
| 456254 | 6669849.0 | -1104.0 | 0.0 | -859.0 | -859.0 | NaN | 0.0 | 45000.0 | 0.000 | 0.0 | 0.0 | -401.0 | 0.0 | -18.0 |
| 456255 | 5126331.0 | -1021.0 | 0.0 | 279.0 | -958.0 | 16618.5 | 0.0 | 436032.0 | 0.000 | 0.0 | 0.0 | -779.0 | 0.0 | -21.0 |
134542 rows × 14 columns
df_bureau1.reset_index(inplace=True)
df_bureau_final = df_bureau1[['SK_ID_CURR','DAYS_CREDIT','DAYS_ENDDATE_FACT','AMT_CREDIT_SUM','DAYS_CREDIT_UPDATE',
'MONTHS_BALANCE',
'AMT_CREDIT_SUM_LIMIT']]
Joining on SK_ID_CURR
df_finalm_bureau_train = pd.merge(left = df_bureau_final, right = datasets['application_train'],
how='inner', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
df_finalm_bureau_train
| SK_ID_CURR | DAYS_CREDIT | DAYS_ENDDATE_FACT | AMT_CREDIT_SUM | DAYS_CREDIT_UPDATE | MONTHS_BALANCE | AMT_CREDIT_SUM_LIMIT | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | ... | EMERGENCYSTATE_MODE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_16 | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | -1043.0 | -967.0 | 40761.0 | -758.0 | -26.0 | 0.0 | 1 | Cash loans | M | ... | No | 2.0 | 2.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 |
| 1 | 100010 | -1939.5 | -1138.0 | 495000.0 | -578.0 | -46.0 | 0.0 | 0 | Cash loans | M | ... | NaN | 2.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 2 | 100019 | -495.0 | NaN | 360000.0 | -26.5 | -8.0 | 135000.0 | 0 | Cash loans | M | ... | NaN | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 1.0 |
| 3 | 100032 | -1169.5 | -662.0 | 331875.0 | -520.5 | -28.0 | 0.0 | 0 | Cash loans | M | ... | NaN | 2.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 4 | 100033 | -195.0 | NaN | 675000.0 | -169.0 | -3.0 | 0.0 | 0 | Cash loans | M | ... | NaN | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 1.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 92226 | 456244 | -2138.0 | -1438.0 | 315000.0 | -1381.0 | -33.5 | 0.0 | 0 | Cash loans | F | ... | No | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 1.0 | 5.0 |
| 92227 | 456247 | -1757.0 | -1402.0 | 143667.0 | -1402.0 | -22.0 | 0.0 | 0 | Cash loans | F | ... | NaN | 9.0 | 2.0 | 1 | 0 | 0 | 0 | 0 | 1.0 | 2.0 |
| 92228 | 456253 | -919.0 | -794.0 | 675000.0 | -179.0 | -14.0 | 0.0 | 0 | Cash loans | F | ... | No | 6.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 1.0 | 1.0 |
| 92229 | 456254 | -1104.0 | -859.0 | 45000.0 | -401.0 | -18.0 | 0.0 | 1 | Cash loans | F | ... | No | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 92230 | 456255 | -1021.0 | -958.0 | 436032.0 | -779.0 | -21.0 | 0.0 | 0 | Cash loans | F | ... | No | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 2.0 | 1.0 |
92231 rows × 58 columns
msno.bar(df_finalm_bureau_train)
<AxesSubplot:>
Storing the data to csv file so that if kernel crashes we can continue the work with this data and save time.
df_finalm_bureau_train.to_csv('bureau_files_merged.csv',index=False)
datasets['POS_CASH_balance'].head()
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | CNT_INSTALMENT | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1803195 | 182943 | -31 | 48.0 | 45.0 | Active | 0 | 0 |
| 1 | 1715348 | 367990 | -33 | 36.0 | 35.0 | Active | 0 | 0 |
| 2 | 1784872 | 397406 | -32 | 12.0 | 9.0 | Active | 0 | 0 |
| 3 | 1903291 | 269225 | -35 | 48.0 | 42.0 | Active | 0 | 0 |
| 4 | 2341044 | 334279 | -35 | 36.0 | 35.0 | Active | 0 | 0 |
datasets['POS_CASH_balance'].shape
(10001358, 8)
to remove irrelavent data using the correlation criterion and domain knowledge
datasets['POS_CASH_balance']['NAME_CONTRACT_STATUS'].value_counts()
Active 9151119 Completed 744883 Signed 87260 Demand 7065 Returned to the store 5461 Approved 4917 Amortized debt 636 Canceled 15 XNA 2 Name: NAME_CONTRACT_STATUS, dtype: int64
df_POS = datasets['POS_CASH_balance'][['SK_ID_PREV','SK_ID_CURR','CNT_INSTALMENT','CNT_INSTALMENT_FUTURE','NAME_CONTRACT_STATUS']]
plt.figure(figsize = (20,20))
sns.heatmap(df_POS.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
df_POS = df_POS.drop(['CNT_INSTALMENT'],axis=1)
df_POS_atrain = pd.merge(left=datasets['application_train'], right=df_POS, how='inner', left_on=['SK_ID_CURR'],
right_on=['SK_ID_CURR'])
df_POS_atrain.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | ... | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_16 | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_YEAR | SK_ID_PREV | CNT_INSTALMENT_FUTURE | NAME_CONTRACT_STATUS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | ... | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 | 1038818 | 18.0 | Active |
| 1 | 100002 | 1 | Cash loans | M | N | Y | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | ... | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 | 1038818 | 21.0 | Active |
| 2 | 100002 | 1 | Cash loans | M | N | Y | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | ... | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 | 1038818 | 23.0 | Active |
| 3 | 100002 | 1 | Cash loans | M | N | Y | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | ... | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 | 1038818 | 22.0 | Active |
| 4 | 100002 | 1 | Cash loans | M | N | Y | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | ... | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 | 1038818 | 10.0 | Active |
5 rows × 55 columns
df_POS_atrain.shape
(8543375, 55)
msno.bar(df_POS_atrain)
<AxesSubplot:>
datasets['credit_card_balance'].head()
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | ... | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | ... | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | ... | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | ... | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | ... | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | ... | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
5 rows × 23 columns
Dropping of columns with no or close to none significance with Target.
df_ccbal = datasets['credit_card_balance'].drop(['SK_DPD','SK_DPD_DEF','MONTHS_BALANCE','NAME_CONTRACT_STATUS'],axis=1)
df_ccbal.head()
| SK_ID_PREV | SK_ID_CURR | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | 1800.0 | 1800.0 | 0.000 | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 |
| 1 | 2582071 | 363914 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | 2250.0 | 2250.0 | 60175.080 | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 |
| 2 | 1740877 | 371185 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | 2250.0 | 2250.0 | 26926.425 | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 |
| 3 | 1389973 | 337855 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | 11925.0 | 11925.0 | 224949.285 | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 |
| 4 | 1891521 | 126868 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | 27000.0 | 27000.0 | 443044.395 | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 |
plt.figure(figsize = (20,20))
sns.heatmap(df_ccbal.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
df_ccbal = df_ccbal.drop(['AMT_INST_MIN_REGULARITY','AMT_RECEIVABLE_PRINCIPAL','AMT_RECIVABLE','AMT_TOTAL_RECEIVABLE',
'AMT_PAYMENT_TOTAL_CURRENT','AMT_DRAWINGS_ATM_CURRENT','CNT_DRAWINGS_CURRENT'],axis=1)
df_ccbal.head()
| SK_ID_PREV | SK_ID_CURR | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_PAYMENT_CURRENT | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | 56.970 | 135000 | 877.5 | 0.0 | 877.5 | 1800.0 | 0.0 | 0.0 | 1.0 | 35.0 |
| 1 | 2582071 | 363914 | 63975.555 | 45000 | 2250.0 | 0.0 | 0.0 | 2250.0 | 1.0 | 0.0 | 0.0 | 69.0 |
| 2 | 1740877 | 371185 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 2250.0 | 0.0 | 0.0 | 0.0 | 30.0 |
| 3 | 1389973 | 337855 | 236572.110 | 225000 | 2250.0 | 0.0 | 0.0 | 11925.0 | 1.0 | 0.0 | 0.0 | 10.0 |
| 4 | 1891521 | 126868 | 453919.455 | 450000 | 11547.0 | 0.0 | 11547.0 | 27000.0 | 0.0 | 0.0 | 1.0 | 101.0 |
Grouping by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining
df_ccbal1=df_ccbal.groupby(['SK_ID_CURR','SK_ID_PREV']).mean()
df_ccbal1 = df_ccbal1.reset_index()
plt.figure(figsize = (20,20))
sns.heatmap(df_ccbal.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
msno.bar(df_ccbal)
<AxesSubplot:>
We will work on Installments Payments first and then join everything with Application train after joining installments payments and Credit Card Balance.
datasets['installments_payments'].head()
| SK_ID_PREV | SK_ID_CURR | NUM_INSTALMENT_VERSION | NUM_INSTALMENT_NUMBER | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | 1.0 | 6 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | 0.0 | 34 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | 2.0 | 1 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | 1.0 | 3 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | 1.0 | 2 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
df_installment = datasets['installments_payments'].drop(['NUM_INSTALMENT_VERSION','NUM_INSTALMENT_NUMBER'],axis=1)
msno.bar(df_installment)
<AxesSubplot:>
df_installment.head()
| SK_ID_PREV | SK_ID_CURR | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|
| 0 | 1054186 | 161674 | -1180.0 | -1187.0 | 6948.360 | 6948.360 |
| 1 | 1330831 | 151639 | -2156.0 | -2156.0 | 1716.525 | 1716.525 |
| 2 | 2085231 | 193053 | -63.0 | -63.0 | 25425.000 | 25425.000 |
| 3 | 2452527 | 199697 | -2418.0 | -2426.0 | 24350.130 | 24350.130 |
| 4 | 2714724 | 167756 | -1383.0 | -1366.0 | 2165.040 | 2160.585 |
Grouping by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining
df_installment1=df_installment.groupby(['SK_ID_CURR','SK_ID_PREV']).mean()
df_installment1 = df_installment1.reset_index()
df_installment1.head()
| SK_ID_CURR | SK_ID_PREV | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | |
|---|---|---|---|---|---|---|
| 0 | 100001 | 1369693 | -1664.0 | -1679.500000 | 7312.725000 | 7312.725000 |
| 1 | 100001 | 1851984 | -2886.0 | -2882.333333 | 3981.675000 | 3981.675000 |
| 2 | 100002 | 1038818 | -295.0 | -315.421053 | 11559.247105 | 11559.247105 |
| 3 | 100003 | 1810518 | -626.0 | -630.428571 | 164425.332857 | 164425.332857 |
| 4 | 100003 | 2396755 | -2145.0 | -2151.750000 | 6731.115000 | 6731.115000 |
df_ccbal1['SK_ID_PREV'].value_counts()
1489396 1
1645533 1
2835518 1
2116495 1
2487050 1
..
2088402 1
2663364 1
2002588 1
1642103 1
1794451 1
Name: SK_ID_PREV, Length: 104307, dtype: int64
df_ccbal_installment1 = pd.merge(left=df_installment1, right=df_ccbal1, how='inner', left_on=['SK_ID_PREV'],
right_on=['SK_ID_PREV'])
df_ccbal_installment1.head()
| SK_ID_CURR_x | SK_ID_PREV | DAYS_INSTALMENT | DAYS_ENTRY_PAYMENT | AMT_INSTALMENT | AMT_PAYMENT | SK_ID_CURR_y | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_PAYMENT_CURRENT | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100011 | 1843384 | -1139.157895 | -1142.434211 | 4723.027697 | 4723.027697 | 100011 | 54482.111149 | 164189.189189 | 2432.432432 | 0.000 | 0.000000 | 4843.064189 | 0.054054 | 0.000000 | 0.000000 | 25.767123 |
| 1 | 100013 | 2038692 | -1562.657895 | -1565.903509 | 6107.987368 | 6038.908421 | 100013 | 18159.919219 | 131718.750000 | 5953.125000 | 0.000 | 0.000000 | 7168.346250 | 0.255556 | 0.000000 | 0.000000 | 18.719101 |
| 2 | 100028 | 1914954 | -700.892857 | -702.559524 | 4215.151071 | 3377.672143 | 100028 | 8085.058163 | 225000.000000 | 6156.400408 | 0.000 | 6242.355000 | 6598.243256 | 0.045455 | 0.000000 | 2.613636 | 19.547619 |
| 3 | 100042 | 2137382 | -1474.926230 | -1479.500000 | 5205.409672 | 5131.639180 | 100042 | 33356.183036 | 78750.000000 | 5923.886786 | 2475.375 | 287.797500 | 7452.703929 | 0.226190 | 0.059524 | 0.095238 | 30.072289 |
| 4 | 100043 | 1557583 | -663.547619 | -666.785714 | 17736.775714 | 16476.312857 | 100043 | 208572.600000 | 287045.454545 | 14904.972273 | 0.000 | 12859.517727 | 21625.160625 | 0.060606 | 0.000000 | 0.303030 | 13.870968 |
We are doing this even after merging each individual table is already processed because there might be some columns that are now insignificant or higly correlated with each other.
msno.bar(df_ccbal_installment1)
<AxesSubplot:>
df_ccbal_installment1.to_csv('df_ccbal_installment.csv',index=False)
plt.figure(figsize = (20,20))
sns.heatmap(df_ccbal_installment1.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
df_ccbal_installment1 = df_ccbal_installment1.drop(['SK_ID_CURR_y','AMT_DRAWINGS_POS_CURRENT','AMT_DRAWINGS_CURRENT',
'AMT_PAYMENT','DAYS_ENTRY_PAYMENT'],axis=1)
df_ccbal_installment1.rename(columns={'SK_ID_CURR_x':'SK_ID_CURR'},inplace=True)
df_ccbal_installment1.head()
| SK_ID_CURR | SK_ID_PREV | DAYS_INSTALMENT | AMT_INSTALMENT | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_OTHER_CURRENT | AMT_PAYMENT_CURRENT | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100011 | 1843384 | -1139.157895 | 4723.027697 | 54482.111149 | 164189.189189 | 0.000 | 4843.064189 | 0.054054 | 0.000000 | 0.000000 | 25.767123 |
| 1 | 100013 | 2038692 | -1562.657895 | 6107.987368 | 18159.919219 | 131718.750000 | 0.000 | 7168.346250 | 0.255556 | 0.000000 | 0.000000 | 18.719101 |
| 2 | 100028 | 1914954 | -700.892857 | 4215.151071 | 8085.058163 | 225000.000000 | 0.000 | 6598.243256 | 0.045455 | 0.000000 | 2.613636 | 19.547619 |
| 3 | 100042 | 2137382 | -1474.926230 | 5205.409672 | 33356.183036 | 78750.000000 | 2475.375 | 7452.703929 | 0.226190 | 0.059524 | 0.095238 | 30.072289 |
| 4 | 100043 | 1557583 | -663.547619 | 17736.775714 | 208572.600000 | 287045.454545 | 0.000 | 21625.160625 | 0.060606 | 0.000000 | 0.303030 | 13.870968 |
df_prev_app = datasets['previous_application']
df_prev_app.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | ... | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | ... | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | ... | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | ... | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | ... | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | ... | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 37 columns
removing columns with high correlation and containing large percentage of NULL values
msno.bar(df_prev_app)
<AxesSubplot:>
df_prev_app = df_prev_app.drop(['RATE_INTEREST_PRIVILEGED','RATE_INTEREST_PRIMARY'],axis=1)
plt.figure(figsize = (20,20))
sns.heatmap(df_prev_app.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
df_prev_app = df_prev_app.drop(['AMT_APPLICATION','AMT_CREDIT','AMT_GOODS_PRICE','DAYS_LAST_DUE'],axis=1)
plt.figure(figsize = (20,20))
sns.heatmap(df_prev_app.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
df_prev_app = df_prev_app.drop(['WEEKDAY_APPR_PROCESS_START','HOUR_APPR_PROCESS_START',
'NAME_SELLER_INDUSTRY','NAME_CASH_LOAN_PURPOSE',
'NAME_CONTRACT_STATUS', 'DAYS_DECISION', 'NAME_PAYMENT_TYPE',
'CODE_REJECT_REASON', 'NAME_TYPE_SUITE', 'NAME_CLIENT_TYPE',
'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION','DAYS_LAST_DUE_1ST_VERSION',
'FLAG_LAST_APPL_PER_CONTRACT','NFLAG_INSURED_ON_APPROVAL',
'NFLAG_LAST_APPL_IN_DAY','NAME_CONTRACT_TYPE','DAYS_FIRST_DRAWING'],axis=1)
df_prev_app.head()
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | AMT_DOWN_PAYMENT | RATE_DOWN_PAYMENT | DAYS_FIRST_DUE | DAYS_TERMINATION | |
|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | 1730.430 | 0.0 | 0.0 | -42.0 | -37.0 |
| 1 | 2802425 | 108129 | 25188.615 | NaN | NaN | -134.0 | 365243.0 |
| 2 | 2523466 | 122040 | 15060.735 | NaN | NaN | -271.0 | 365243.0 |
| 3 | 2819243 | 176158 | 47041.335 | NaN | NaN | -482.0 | -177.0 |
| 4 | 1784265 | 202054 | 31924.395 | NaN | NaN | NaN | NaN |
df_prev_app = df_prev_app.drop(['AMT_DOWN_PAYMENT','RATE_DOWN_PAYMENT'],axis=1)
Grouping by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining
df_prev_app = df_prev_app.groupby(['SK_ID_PREV','SK_ID_CURR']).mean()
df_prev_app = df_prev_app.reset_index()
df_prev_app.head()
| SK_ID_PREV | SK_ID_CURR | AMT_ANNUITY | DAYS_FIRST_DUE | DAYS_TERMINATION | |
|---|---|---|---|---|---|
| 0 | 1000001 | 158271 | 6404.310 | -268.0 | -233.0 |
| 1 | 1000002 | 101962 | 6264.000 | -1600.0 | -1501.0 |
| 2 | 1000003 | 252457 | 4951.350 | -94.0 | 365243.0 |
| 3 | 1000004 | 260094 | 3391.110 | -862.0 | -672.0 |
| 4 | 1000005 | 176456 | 14713.605 | -1688.0 | -1415.0 |
df_prev_app.shape
(1670214, 5)
df_prev_ccbalinst = pd.merge(left=df_prev_app, right=df_ccbal_installment1, how='inner', left_on=['SK_ID_PREV','SK_ID_CURR'],
right_on=['SK_ID_PREV','SK_ID_CURR'])
df_prev_ccbalinst.shape
(62271, 15)
We are doing this even after merging each individual table is already processed because there might be some columns that are now insignificant or higly correlated with each other.
msno.bar(df_prev_ccbalinst)
<AxesSubplot:>
plt.figure(figsize = (20,20))
sns.heatmap(df_prev_ccbalinst.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
df_prev_ccbalinst = df_prev_ccbalinst.drop(['AMT_CREDIT_LIMIT_ACTUAL'],axis=1)
plt.figure(figsize = (20,20))
sns.heatmap(df_prev_ccbalinst.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
Left joining on SK_ID_CURR
df_train_prev_ccbalinstal = pd.merge(left=df_finalm_bureau_train, right=df_prev_ccbalinst, how='left',
left_on=['SK_ID_CURR'],
right_on=['SK_ID_CURR'])
df_train_prev_ccbalinstal.shape
(92288, 71)
We are doing this even after merging each individual table is already processed because there might be some columns that are now insignificant or higly correlated with each other.
msno.bar(df_train_prev_ccbalinstal)
<AxesSubplot:>
df_train_prev_ccbalinstal.to_csv('final_dataframe.csv',index= False)
df_final_merged = df_train_prev_ccbalinstal
df_final_merged.head()
| SK_ID_CURR | DAYS_CREDIT | DAYS_ENDDATE_FACT | AMT_CREDIT_SUM | DAYS_CREDIT_UPDATE | MONTHS_BALANCE | AMT_CREDIT_SUM_LIMIT | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | ... | DAYS_TERMINATION | DAYS_INSTALMENT | AMT_INSTALMENT | AMT_BALANCE | AMT_DRAWINGS_OTHER_CURRENT | AMT_PAYMENT_CURRENT | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | -1043.0 | -967.0 | 40761.0 | -758.0 | -26.0 | 0.0 | 1 | Cash loans | M | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 100010 | -1939.5 | -1138.0 | 495000.0 | -578.0 | -46.0 | 0.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 100019 | -495.0 | NaN | 360000.0 | -26.5 | -8.0 | 135000.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 100032 | -1169.5 | -662.0 | 331875.0 | -520.5 | -28.0 | 0.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100033 | -195.0 | NaN | 675000.0 | -169.0 | -3.0 | 0.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 71 columns
df_train_prev_ccbalinstal.head()
| SK_ID_CURR | DAYS_CREDIT | DAYS_ENDDATE_FACT | AMT_CREDIT_SUM | DAYS_CREDIT_UPDATE | MONTHS_BALANCE | AMT_CREDIT_SUM_LIMIT | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | ... | DAYS_TERMINATION | DAYS_INSTALMENT | AMT_INSTALMENT | AMT_BALANCE | AMT_DRAWINGS_OTHER_CURRENT | AMT_PAYMENT_CURRENT | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | -1043.0 | -967.0 | 40761.0 | -758.0 | -26.0 | 0.0 | 1 | Cash loans | M | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 100010 | -1939.5 | -1138.0 | 495000.0 | -578.0 | -46.0 | 0.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 100019 | -495.0 | NaN | 360000.0 | -26.5 | -8.0 | 135000.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 100032 | -1169.5 | -662.0 | 331875.0 | -520.5 | -28.0 | 0.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100033 | -195.0 | NaN | 675000.0 | -169.0 | -3.0 | 0.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 71 columns
Feature engineering an important part of machine-learning as we try to modify/create (i.e., engineer) new features from our existing dataset that might be meaningful in predicting the TARGET.
Correlation of features with Target Variable in order to engineer new fatures
df_final_merged.corrwith(df_final_merged["TARGET"]).sort_values(ascending = False)
TARGET 1.000000 CNT_DRAWINGS_ATM_CURRENT 0.108122 DAYS_INSTALMENT 0.082245 DAYS_CREDIT 0.077823 DAYS_BIRTH 0.076995 MONTHS_BALANCE 0.069535 DAYS_CREDIT_UPDATE 0.060649 REGION_RATING_CLIENT_W_CITY 0.058259 REG_CITY_NOT_WORK_CITY 0.056437 DAYS_ENDDATE_FACT 0.052098 AMT_BALANCE 0.050537 FLAG_EMP_PHONE 0.049907 DAYS_ID_PUBLISH 0.044734 DAYS_REGISTRATION 0.043915 OWN_CAR_AGE 0.041009 CNT_DRAWINGS_POS_CURRENT 0.037029 DAYS_TERMINATION 0.034801 FLAG_DOCUMENT_3 0.033429 DEF_60_CNT_SOCIAL_CIRCLE 0.028113 CNT_DRAWINGS_OTHER_CURRENT 0.027778 FLAG_WORK_PHONE 0.024296 CNT_FAM_MEMBERS 0.014143 AMT_REQ_CREDIT_BUREAU_YEAR 0.011893 OBS_60_CNT_SOCIAL_CIRCLE 0.007209 AMT_DRAWINGS_OTHER_CURRENT 0.006293 AMT_PAYMENT_CURRENT 0.001518 AMT_INSTALMENT 0.000804 SK_ID_CURR -0.003079 AMT_CREDIT_SUM_LIMIT -0.004759 AMT_ANNUITY_x -0.005442 AMT_REQ_CREDIT_BUREAU_MON -0.005491 LANDAREA_AVG -0.005879 FLAG_DOCUMENT_13 -0.008082 FLAG_DOCUMENT_14 -0.008331 SK_ID_PREV -0.008585 DAYS_FIRST_DUE -0.008960 YEARS_BEGINEXPLUATATION_AVG -0.010040 FLAG_DOCUMENT_16 -0.011845 ENTRANCES_AVG -0.013182 AMT_INCOME_TOTAL -0.015043 AMT_ANNUITY_y -0.016477 HOUR_APPR_PROCESS_START -0.020789 AMT_CREDIT_SUM -0.021473 AMT_CREDIT -0.026572 FLAG_PHONE -0.027116 BASEMENTAREA_AVG -0.027292 APARTMENTS_AVG -0.029746 FLAG_DOCUMENT_6 -0.032099 REGION_POPULATION_RELATIVE -0.034377 FLOORSMAX_AVG -0.041341 DAYS_EMPLOYED -0.048916 CNT_INSTALMENT_MATURE_CUM -0.063045 EXT_SOURCE_2 -0.136431 EXT_SOURCE_1 -0.148456 EXT_SOURCE_3 -0.175782 dtype: float64
plt.figure(figsize = (40,40))
sns.heatmap(df_final_merged.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
Often, experts have domain knowledge about what combination of existing features have strong explanatory/predictive power. In this case we are looking at the following features
df_final_merged['DAYS_EMPLOYED_PCT'] = df_final_merged['DAYS_EMPLOYED'] / df_final_merged['DAYS_BIRTH']
df_final_merged['DAYS_EMPLOYED_PCT'] = df_final_merged['DAYS_EMPLOYED'] / df_final_merged['DAYS_BIRTH']
Percentage of days employed - How long a person has been employed as a percentage of his life is a stronger predictor of his ability to keep paying off his loans.
We have taken this parameter because the total number of days a person has been employed in his life time will impact in his credit paying ability and since this is highly correlated to the target variable we have taken this feature to create new feature out of it.
df_final_merged['CREDIT_INCOME_PCT'] = df_final_merged['AMT_CREDIT'] / df_final_merged['AMT_INCOME_TOTAL']
df_final_merged['CREDIT_INCOME_PCT'] = df_final_merged['AMT_CREDIT'] / df_final_merged['AMT_INCOME_TOTAL']
df_final_merged['ANNUITY_INCOME_PCT'] = df_final_merged['AMT_ANNUITY_x'] / df_final_merged['AMT_INCOME_TOTAL']
df_final_merged['ANNUITY_INCOME_PCT'] = df_final_merged['AMT_ANNUITY_x'] / df_final_merged['AMT_INCOME_TOTAL']
Annuity as a percentage of income - If a person receives an annuity, this is a more stable source of income thus if it is higher, you are less likely to default.
We have taken this feature because of the reason mentioned in the previous line and since it is highly correlated with the target variable.
df_final_merged['CREDIT_TERM_PCT'] = df_final_merged['AMT_ANNUITY_x'] / df_final_merged['AMT_CREDIT']
df_final_merged['CREDIT_TERM_PCT'] = df_final_merged['AMT_ANNUITY_x'] / df_final_merged['AMT_CREDIT']
df_final_merged['AMT_BALANCE_PCT'] = df_final_merged['AMT_BALANCE'] / df_final_merged['DAYS_CREDIT']
df_final_merged['AMT_BALANCE_PCT'] = df_final_merged['AMT_BALANCE'] / df_final_merged['DAYS_CREDIT']
df_final_merged['AVG_INCOME_EXT_PCT'] = (df_final_merged['EXT_SOURCE_1']+df_final_merged['EXT_SOURCE_2']
+df_final_merged['EXT_SOURCE_3'])/3
df_final_merged['AVG_INCOME_EXT_PCT'] = (df_final_merged['EXT_SOURCE_1']+df_final_merged['EXT_SOURCE_2'] +df_final_merged['EXT_SOURCE_3'])/3
df_final_merged['AVG_TOTALINCOME_PCT'] = df_final_merged['AMT_INCOME_TOTAL']/df_final_merged['AVG_INCOME_EXT_PCT']
df_final_merged['AVG_TOTALINCOME_PCT'] = df_final_merged['AMT_INCOME_TOTAL']/df_final_merged['AVG_INCOME_EXT_PCT']
# Training dataset
# df_final_merged['DAYS_EMPLOYED_PCT'] = df_final_merged['DAYS_EMPLOYED'] / df_final_merged['DAYS_BIRTH']
# df_final_merged['CREDIT_INCOME_PCT'] = df_final_merged['AMT_CREDIT'] / df_final_merged['AMT_INCOME_TOTAL']
# df_final_merged['ANNUITY_INCOME_PCT'] = df_final_merged['AMT_ANNUITY_x'] / df_final_merged['AMT_INCOME_TOTAL']
# df_final_merged['CREDIT_TERM_PCT'] = df_final_merged['AMT_ANNUITY_x'] / df_final_merged['AMT_CREDIT']
# df_final_merged['AMT_BALANCE_PCT'] = df_final_merged['AMT_BALANCE'] / df_final_merged['DAYS_CREDIT']
# df_final_merged['AVG_INCOME_EXT_PCT'] = (df_final_merged['EXT_SOURCE_1']+df_final_merged['EXT_SOURCE_2']
# +df_final_merged['EXT_SOURCE_3'])/3
# df_final_merged['AVG_TOTALINCOME_PCT'] = df_final_merged['AMT_INCOME_TOTAL']/df_final_merged['AVG_INCOME_EXT_PCT']
df_final_merged.columns
Index(['SK_ID_CURR', 'DAYS_CREDIT', 'DAYS_ENDDATE_FACT', 'AMT_CREDIT_SUM',
'DAYS_CREDIT_UPDATE', 'MONTHS_BALANCE', 'AMT_CREDIT_SUM_LIMIT',
'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR',
'FLAG_OWN_REALTY', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY_x',
'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE',
'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE',
'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
'OWN_CAR_AGE', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_PHONE',
'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT_W_CITY',
'WEEKDAY_APPR_PROCESS_START', 'HOUR_APPR_PROCESS_START',
'REG_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE', 'EXT_SOURCE_1',
'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG',
'YEARS_BEGINEXPLUATATION_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG',
'LANDAREA_AVG', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE',
'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'OBS_60_CNT_SOCIAL_CIRCLE',
'DEF_60_CNT_SOCIAL_CIRCLE', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_6',
'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_16',
'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'SK_ID_PREV',
'AMT_ANNUITY_y', 'DAYS_FIRST_DUE', 'DAYS_TERMINATION',
'DAYS_INSTALMENT', 'AMT_INSTALMENT', 'AMT_BALANCE',
'AMT_DRAWINGS_OTHER_CURRENT', 'AMT_PAYMENT_CURRENT',
'CNT_DRAWINGS_ATM_CURRENT', 'CNT_DRAWINGS_OTHER_CURRENT',
'CNT_DRAWINGS_POS_CURRENT', 'CNT_INSTALMENT_MATURE_CUM',
'DAYS_EMPLOYED_PCT', 'CREDIT_INCOME_PCT', 'ANNUITY_INCOME_PCT',
'CREDIT_TERM_PCT', 'AMT_BALANCE_PCT', 'AVG_INCOME_EXT_PCT',
'AVG_TOTALINCOME_PCT'],
dtype='object')
df_final_merged.corrwith(df_final_merged["TARGET"]).sort_values(ascending = False)
TARGET 1.000000
AVG_TOTALINCOME_PCT 0.123221
CNT_DRAWINGS_ATM_CURRENT 0.108122
DAYS_INSTALMENT 0.082245
DAYS_CREDIT 0.077823
...
AMT_BALANCE_PCT -0.066617
EXT_SOURCE_2 -0.136431
EXT_SOURCE_1 -0.148456
EXT_SOURCE_3 -0.175782
AVG_INCOME_EXT_PCT -0.225926
Length: 62, dtype: float64
plt.figure(figsize = (40,40))
sns.heatmap(df_final_merged.corr(),annot=True, cmap='BuPu')
<AxesSubplot:>
plot_df = df_final_merged[['TARGET','DAYS_EMPLOYED_PCT', 'CREDIT_INCOME_PCT', 'ANNUITY_INCOME_PCT',
'CREDIT_TERM_PCT', 'AMT_BALANCE_PCT', 'AVG_INCOME_EXT_PCT',
'AVG_TOTALINCOME_PCT']]
plot_df.head()
| TARGET | DAYS_EMPLOYED_PCT | CREDIT_INCOME_PCT | ANNUITY_INCOME_PCT | CREDIT_TERM_PCT | AMT_BALANCE_PCT | AVG_INCOME_EXT_PCT | AVG_TOTALINCOME_PCT | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.067329 | 2.007889 | 0.121978 | 0.060749 | NaN | 0.161787 | 1.251645e+06 |
| 1 | 0 | 0.023820 | 4.250000 | 0.116875 | 0.027500 | NaN | NaN | NaN |
| 2 | 0 | 0.132562 | 1.903314 | 0.128000 | 0.067251 | NaN | 0.380279 | 4.141700e+05 |
| 3 | 0 | 0.077376 | 2.906880 | 0.211800 | 0.072862 | NaN | NaN | NaN |
| 4 | 0 | 0.179708 | 2.929000 | 0.213617 | 0.072932 | NaN | 0.603301 | 4.475377e+05 |
feature_plot = pd.DataFrame(plot_df.corrwith(plot_df["TARGET"]).sort_values(ascending = False)).reset_index()
feature_plot = feature_plot.rename(columns={'index':'Column_name',0:'corr_with_target'})
plt.figure(figsize=(10,10))
sns.barplot(data=feature_plot,x=feature_plot['Column_name'][1:], y= feature_plot['corr_with_target'][1:], palette = 'BuPu')
plt.xticks(rotation=90)
plt.title('Correlation of Features with Target Variable', fontweight = 'bold', fontsize = 16)
plt.show()
df_final_merged.head()
| SK_ID_CURR | DAYS_CREDIT | DAYS_ENDDATE_FACT | AMT_CREDIT_SUM | DAYS_CREDIT_UPDATE | MONTHS_BALANCE | AMT_CREDIT_SUM_LIMIT | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | ... | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | DAYS_EMPLOYED_PCT | CREDIT_INCOME_PCT | ANNUITY_INCOME_PCT | CREDIT_TERM_PCT | AMT_BALANCE_PCT | AVG_INCOME_EXT_PCT | AVG_TOTALINCOME_PCT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | -1043.0 | -967.0 | 40761.0 | -758.0 | -26.0 | 0.0 | 1 | Cash loans | M | ... | NaN | NaN | NaN | 0.067329 | 2.007889 | 0.121978 | 0.060749 | NaN | 0.161787 | 1.251645e+06 |
| 1 | 100010 | -1939.5 | -1138.0 | 495000.0 | -578.0 | -46.0 | 0.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | 0.023820 | 4.250000 | 0.116875 | 0.027500 | NaN | NaN | NaN |
| 2 | 100019 | -495.0 | NaN | 360000.0 | -26.5 | -8.0 | 135000.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | 0.132562 | 1.903314 | 0.128000 | 0.067251 | NaN | 0.380279 | 4.141700e+05 |
| 3 | 100032 | -1169.5 | -662.0 | 331875.0 | -520.5 | -28.0 | 0.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | 0.077376 | 2.906880 | 0.211800 | 0.072862 | NaN | NaN | NaN |
| 4 | 100033 | -195.0 | NaN | 675000.0 | -169.0 | -3.0 | 0.0 | 0 | Cash loans | M | ... | NaN | NaN | NaN | 0.179708 | 2.929000 | 0.213617 | 0.072932 | NaN | 0.603301 | 4.475377e+05 |
5 rows × 78 columns
X_merged = df_final_merged.drop(['TARGET'],axis=1)
y_merged = df_final_merged['TARGET']
y_merged = pd.DataFrame(y_merged, columns = ["TARGET"])
y_merged
| TARGET | |
|---|---|
| 0 | 1 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| ... | ... |
| 92283 | 0 |
| 92284 | 0 |
| 92285 | 0 |
| 92286 | 1 |
| 92287 | 0 |
92288 rows × 1 columns
X_train_merged,X_test_merged,y_train_merged,y_test_merged = train_test_split(X_merged, y_merged, test_size=0.2,random_state=42)
integer_df_merged= X_merged.select_dtypes(include='int64')
float_df_merged = X_merged.select_dtypes(include='float64')
numerical_df_merged = list(pd.concat([integer_df_merged,float_df_merged], axis=1))
categorical_df_merged = list(X_merged.select_dtypes(include='object'))
print(categorical_df_merged)
print(' ')
print(numerical_df_merged)
['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
['SK_ID_CURR', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_PHONE', 'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START', 'REG_CITY_NOT_WORK_CITY', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_13', 'FLAG_DOCUMENT_14', 'FLAG_DOCUMENT_16', 'DAYS_CREDIT', 'DAYS_ENDDATE_FACT', 'AMT_CREDIT_SUM', 'DAYS_CREDIT_UPDATE', 'MONTHS_BALANCE', 'AMT_CREDIT_SUM_LIMIT', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY_x', 'REGION_POPULATION_RELATIVE', 'DAYS_REGISTRATION', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'LANDAREA_AVG', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_YEAR', 'SK_ID_PREV', 'AMT_ANNUITY_y', 'DAYS_FIRST_DUE', 'DAYS_TERMINATION', 'DAYS_INSTALMENT', 'AMT_INSTALMENT', 'AMT_BALANCE', 'AMT_DRAWINGS_OTHER_CURRENT', 'AMT_PAYMENT_CURRENT', 'CNT_DRAWINGS_ATM_CURRENT', 'CNT_DRAWINGS_OTHER_CURRENT', 'CNT_DRAWINGS_POS_CURRENT', 'CNT_INSTALMENT_MATURE_CUM', 'DAYS_EMPLOYED_PCT', 'CREDIT_INCOME_PCT', 'ANNUITY_INCOME_PCT', 'CREDIT_TERM_PCT', 'AMT_BALANCE_PCT', 'AVG_INCOME_EXT_PCT', 'AVG_TOTALINCOME_PCT']
In this part of the project the focus is on constructing the pipeline. Since the data has both numerical and categorical features, it is required to create two pipelines (one for each category of data) because they require different transformations. After finishing that, the two pipelines should be unified to produce one full pipeline that performs transformation on all the dataset. After this we will perform hyperparmeter tuning using GridSearchCV and taking different parameters for training the model. We used Logistic Regression and Random forest algorithms.
num_pipeline = Pipeline([
('scaler', StandardScaler()),
('imputer', SimpleImputer(strategy = 'median'))
])
cat_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
])
data_pipeline_merged = ColumnTransformer([
("num_pipeline", num_pipeline, numerical_df_merged),
("cat_pipeline", cat_pipeline, categorical_df_merged)])
model = Pipeline(steps = [('data_pipeline_merge',data_pipeline_merged),
('classifier', LogisticRegression(solver='lbfgs',penalty='l2'))])
from IPython.display import Image
Image(url= "https://prwatech.in/blog/wp-content/uploads/2020/02/logi1.png", width=900, height=900)
Logistic regression (LR) is a statistical method similar to linear regression since LR finds an equation that predicts an outcome for a binary variable, Y, from one or more response variables, X. However, unlike linear regression the response variables can be categorical or continuous, as the model does not strictly require continuous data. To predict group membership, LR uses the log odds ratio rather than probabilities and an iterative maximum likelihood method rather than a least squares to fit the final model. This means the researcher has more freedom when using LR and the method may be more appropriate for nonnormally distributed data or when the samples have unequal covariance matrices. Logistic regression assumes independence among variables, which is not always met in morphoscopic datasets.
from IPython.display import Image
Image(url= "https://www.researchgate.net/publication/327512672/figure/fig2/AS:668286433914884@1536343310243/Basic-sigmoid-function-Note-A-Basic-sigmoid-function-with-two-parameters-c-1-and-c-2.jpg", width=900, height=900)
Prediction probability of Class 0 and Class 1
param_grid = {
'classifier__C': [0.1,1,10,100]
}
gs = GridSearchCV(model, param_grid, cv=5, n_jobs = -1, verbose = 2, refit = True)
start = time()
gs.fit(X_train_merged,y_train_merged)
trainAcc = gs.score(X_train_merged, y_train_merged)
validAcc = gs.score(X_test_merged, y_test_merged)
start = time()
testAcc = gs.score(X_test_merged, y_test_merged)
test_time = np.round(time() - start, 4)
#del experimentLog
try: experimentLog
except : experimentLog = pd.DataFrame(columns=["Pipeline", "Dataset", "TrainAcc", "ValidAcc", "TestAcc",
"Train Time(s)", "Test Time(s)", "Description",])
experimentLog.loc[len(experimentLog)] =[f"GridSearchCV LogReg", "HCDR",
f"{trainAcc*100:8.2f}%", f"{validAcc*100:8.2f}%", f"{testAcc*100:8.2f}%",
train_time, test_time,
"GridSearchCV LogReg pipeline with Cat+Num features"]
experimentLog
Fitting 5 folds for each of 4 candidates, totalling 20 fits
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | Train Time(s) | Test Time(s) | Description | |
|---|---|---|---|---|---|---|---|---|
| 0 | Baseline 1 LogReg | HCDR | 91.91% | 91.94% | 91.94% | 17.1005 | 0.4038 | Baseline 1 LogReg pipeline with Cat+Num features |
| 1 | Baseline 1 RandomForest | HCDR | 91.91% | 91.95% | 91.95% | 10.7311 | 0.4899 | Baseline 1 RandomForest pipeline with Cat+Num ... |
| 2 | GridSearchCV LogReg | HCDR | 91.79% | 91.99% | 91.99% | 10.7311 | 0.2651 | GridSearchCV LogReg pipeline with Cat+Num feat... |
y_pred_merged = gs.predict(X_test_merged)
res_merged = gs.predict_proba(X_train_merged)
res_merged
array([[0.98358155, 0.01641845],
[0.98011809, 0.01988191],
[0.97328193, 0.02671807],
...,
[0.96764864, 0.03235136],
[0.99244162, 0.00755838],
[0.96137727, 0.03862273]])
Prediction probability of Class 0 and Class 1
print('Validation set accuracy score: ' + str(accuracy_score(y_test_merged,y_pred_merged)))
Validation set accuracy score: 0.9198721421605808
Image(url= "https://miro.medium.com/max/1192/1*wilGXrItaMAJmZNl6RJq9Q.png", width=900, height=900)
log_loss(y_test_merged,y_pred_merged)
2.7675196811129577
roc_auc_score(y_test_merged, gs.predict_proba(X_test_merged)[:, 1])
0.738818586860982
metrics.plot_roc_curve(gs, X_test_merged, y_test_merged)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f42740f2a00>
from sklearn.metrics import accuracy_score, recall_score, plot_roc_curve, confusion_matrix, classification_report, precision_recall_curve, auc
cf_matrix = confusion_matrix(y_test_merged, y_pred_merged)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,
fmt='.2%', cmap='Purples')
plt.title('Confusion Matrix of Logistic Regression Classifier', fontweight='bold', fontsize=16)
plt.xlabel('Predicted', fontweight='bold', fontsize=12)
plt.ylabel('Actual', fontweight='bold', fontsize=12)
Text(33.0, 0.5, 'Actual')
Random Forest works in two-phase first is to create the random forest by combining N decision tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Image(url= "https://cdn.corporatefinanceinstitute.com/assets/random-forest.png", width=900, height=900)
Image(url= "https://miro.medium.com/max/1400/1*cBaV_bLVW-gT6PkG5ve26A.png", width=900, height=900)
data_pipeline_merged = ColumnTransformer([
("num_pipeline", num_pipeline, numerical_df_merged),
("cat_pipeline", cat_pipeline, categorical_df_merged)])
model_rf = Pipeline(steps=[('data_pipeline_merged',data_pipeline_merged),
('classifier', RandomForestClassifier())])
param_grid = {
'classifier__n_estimators': [100,200],
# 'classifier__max_depth':[4,5,6,7],
'classifier__criterion':['gini','entropy']
}
gs_rf = GridSearchCV(model_rf,param_grid, cv=5, n_jobs = -1, verbose = 2, refit = True)
start = time()
gs_rf.fit(X_train_merged,y_train_merged)
trainAcc = gs_rf.score(X_train_merged, y_train_merged)
validAcc = gs_rf.score(X_test_merged, y_test_merged)
start = time()
testAcc = gs_rf.score(X_test_merged, y_test_merged)
test_time = np.round(time() - start, 4)
#del experimentLog
try: experimentLog
except : experimentLog = pd.DataFrame(columns=["Pipeline", "Dataset", "TrainAcc", "ValidAcc", "TestAcc",
"Train Time(s)", "Test Time(s)", "Description",])
experimentLog.loc[len(experimentLog)] =[f"GridSearchCV RandomForestClassifier", "HCDR",
f"{trainAcc*100:8.2f}%", f"{validAcc*100:8.2f}%", f"{testAcc*100:8.2f}%",
train_time, test_time,
"GridSearchCV RandomForestClassifier pipeline with Cat+Num features"]
experimentLog
Fitting 5 folds for each of 4 candidates, totalling 20 fits
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | Train Time(s) | Test Time(s) | Description | |
|---|---|---|---|---|---|---|---|---|
| 0 | Baseline 1 LogReg | HCDR | 91.91% | 91.94% | 91.94% | 17.1005 | 0.4038 | Baseline 1 LogReg pipeline with Cat+Num features |
| 1 | Baseline 1 RandomForest | HCDR | 91.91% | 91.95% | 91.95% | 10.7311 | 0.4899 | Baseline 1 RandomForest pipeline with Cat+Num ... |
| 2 | GridSearchCV LogReg | HCDR | 91.79% | 91.99% | 91.99% | 10.7311 | 0.2651 | GridSearchCV LogReg pipeline with Cat+Num feat... |
| 3 | GridSearchCV RandomForestClassifier | HCDR | 100.00% | 92.09% | 92.09% | 10.7311 | 1.2215 | GridSearchCV RandomForestClassifier pipeline w... |
clf_pipe_merged_rf = make_pipeline(data_pipeline_merged, RandomForestClassifier(n_estimators = 100, random_state = 50, verbose = 1, n_jobs = -1))
# clf_pipe.fit(X_train, y_train)
# train_acc = clf_pipe.score(X_train, y_train)
# # validAcc = clf_pipe.score(X_valid, y_valid)
# testAcc = clf_pipe.score(X_test, y_test)
# print(train_acc,testAcc)
start = time()
clf_pipe_merged_rf.fit(X_train_merged, y_train_merged)
trainAcc = clf_pipe_merged_rf.score(X_train_merged, y_train_merged)
validAcc = clf_pipe_merged_rf.score(X_test_merged, y_test_merged)
start = time()
testAcc = clf_pipe_merged_rf.score(X_test_merged, y_test_merged)
test_time = np.round(time() - start, 4)
#del experimentLog
try: experimentLog
except : experimentLog = pd.DataFrame(columns=["Pipeline", "Dataset", "TrainAcc", "ValidAcc", "TestAcc",
"Train Time(s)", "Test Time(s)", "Description",])
experimentLog.loc[len(experimentLog)] =[f"GridSearchCV RandomForestClassifier", "HCDR",
f"{trainAcc*100:8.2f}%", f"{validAcc*100:8.2f}%", f"{testAcc*100:8.2f}%",
train_time, test_time,
"GridSearchCV RandomForestClassifier pipeline with Cat+Num features"]
experimentLog
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 64 concurrent workers. [Parallel(n_jobs=-1)]: Done 74 out of 100 | elapsed: 1.9s remaining: 0.7s [Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 2.1s finished [Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers. [Parallel(n_jobs=64)]: Done 74 out of 100 | elapsed: 0.2s remaining: 0.1s [Parallel(n_jobs=64)]: Done 100 out of 100 | elapsed: 0.2s finished [Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers. [Parallel(n_jobs=64)]: Done 74 out of 100 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=64)]: Done 100 out of 100 | elapsed: 0.1s finished [Parallel(n_jobs=64)]: Using backend ThreadingBackend with 64 concurrent workers. [Parallel(n_jobs=64)]: Done 74 out of 100 | elapsed: 0.1s remaining: 0.0s [Parallel(n_jobs=64)]: Done 100 out of 100 | elapsed: 0.1s finished
| Pipeline | Dataset | TrainAcc | ValidAcc | TestAcc | Train Time(s) | Test Time(s) | Description | |
|---|---|---|---|---|---|---|---|---|
| 0 | Baseline 1 LogReg | HCDR | 91.91% | 91.94% | 91.94% | 17.1005 | 0.4038 | Baseline 1 LogReg pipeline with Cat+Num features |
| 1 | Baseline 1 RandomForest | HCDR | 91.91% | 91.95% | 91.95% | 10.7311 | 0.4899 | Baseline 1 RandomForest pipeline with Cat+Num ... |
| 2 | GridSearchCV LogReg | HCDR | 91.79% | 91.99% | 91.99% | 10.7311 | 0.2651 | GridSearchCV LogReg pipeline with Cat+Num feat... |
| 3 | GridSearchCV RandomForestClassifier | HCDR | 100.00% | 92.09% | 92.09% | 10.7311 | 1.2215 | GridSearchCV RandomForestClassifier pipeline w... |
| 4 | GridSearchCV RandomForestClassifier | HCDR | 100.00% | 92.09% | 92.09% | 10.7311 | 0.2967 | GridSearchCV RandomForestClassifier pipeline w... |
y_pred_merged_rf = gs_rf.predict(X_test_merged)
res_merged_rf = gs_rf.predict_proba(X_train_merged)
res_merged_rf
array([[0.99 , 0.01 ],
[0.99 , 0.01 ],
[0.975, 0.025],
...,
[0.97 , 0.03 ],
[0.995, 0.005],
[0.99 , 0.01 ]])
print('Validation set accuracy score: ' + str(accuracy_score(y_test_merged,y_pred_merged_rf)))
Validation set accuracy score: 0.9209015061220067
log_loss(y_test_merged,y_pred_merged_rf)
2.7319651932262223
roc_auc_score(y_test_merged, gs_rf.predict_proba(X_test_merged)[:, 1])
0.7153041501828019
metrics.plot_roc_curve(gs_rf, X_test_merged, y_test_merged)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7f42de792f70>
from sklearn.metrics import accuracy_score, recall_score, plot_roc_curve, confusion_matrix, classification_report, precision_recall_curve, auc
cf_matrix = confusion_matrix(y_test_merged, y_pred_merged)
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,
fmt='.2%', cmap='Purples')
plt.title('Confusion Matrix of Random Forest Classifier', fontweight='bold', fontsize=16)
plt.xlabel('Predicted', fontweight='bold', fontsize=12)
plt.ylabel('Actual', fontweight='bold', fontsize=12)
Text(33.0, 0.5, 'Actual')
datasets['application_test'] = datasets['application_test'].drop(['REG_REGION_NOT_WORK_REGION','REG_REGION_NOT_LIVE_REGION',
'FLAG_DOCUMENT_2',
'FLAG_DOCUMENT_21',
'AMT_REQ_CREDIT_BUREAU_DAY',
'FLAG_EMAIL',
'AMT_REQ_CREDIT_BUREAU_QRT',
'FLAG_DOCUMENT_4',
'FLAG_DOCUMENT_17',
'FLAG_DOCUMENT_11',
'FLAG_DOCUMENT_9',
'FLAG_DOCUMENT_15',
'FLAG_DOCUMENT_18',
'FLAG_DOCUMENT_8','TOTALAREA_MODE','ELEVATORS_AVG','LIVINGAPARTMENTS_AVG',
'LIVINGAREA_AVG','APARTMENTS_MODE','ELEVATORS_MODE',
'LIVINGAPARTMENTS_MODE','LIVINGAREA_MODE'
,'APARTMENTS_MEDI','ELEVATORS_MEDI','LIVINGAPARTMENTS_MEDI','LIVINGAREA_MEDI',
'NONLIVINGAREA_MEDI','NONLIVINGAPARTMENTS_MEDI','FLOORSMIN_MEDI',
'FLOORSMAX_MEDI','ENTRANCES_MEDI','FLOORSMAX_MODE','FLOORSMIN_MODE','BASEMENTAREA_MEDI',
'BASEMENTAREA_MODE','YEARS_BUILD_MEDI','NONLIVINGAPARTMENTS_MODE','NONLIVINGAREA_MODE','YEARS_BUILD_MEDI',
'YEARS_BUILD_MODE','YEARS_BEGINEXPLUATATION_MEDI','YEARS_BEGINEXPLUATATION_MODE','COMMONAREA_MEDI','COMMONAREA_MODE',
'ENTRANCES_MODE','LANDAREA_MODE','LANDAREA_MEDI','DEF_30_CNT_SOCIAL_CIRCLE','OBS_30_CNT_SOCIAL_CIRCLE',
'REG_CITY_NOT_LIVE_CITY','REGION_RATING_CLIENT',
'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'FLOORSMIN_AVG','DAYS_LAST_PHONE_CHANGE',
'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG',
'LIVE_CITY_NOT_WORK_CITY','LIVE_REGION_NOT_WORK_REGION','AMT_GOODS_PRICE',
'CNT_CHILDREN',
'AMT_REQ_CREDIT_BUREAU_HOUR',
'AMT_REQ_CREDIT_BUREAU_WEEK',
'FLAG_MOBIL',
'FLAG_CONT_MOBILE',
'FLAG_DOCUMENT_20',
'FLAG_DOCUMENT_5',
'FLAG_DOCUMENT_12',
'FLAG_DOCUMENT_19',
'FLAG_DOCUMENT_10',
'FLAG_DOCUMENT_7'],axis=1)
percent = (datasets["application_test"].isnull().sum()/datasets["application_test"].isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = datasets["application_test"].isna().sum().sort_values(ascending = False)
missing_application_train_data = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Test Missing Count"])
missing_application_train_data.head(20)
| Percent | Test Missing Count | |
|---|---|---|
| FONDKAPREMONT_MODE | 67.28 | 32797 |
| OWN_CAR_AGE | 66.29 | 32312 |
| LANDAREA_AVG | 57.96 | 28254 |
| BASEMENTAREA_AVG | 56.71 | 27641 |
| WALLSMATERIAL_MODE | 49.02 | 23893 |
| APARTMENTS_AVG | 49.01 | 23887 |
| HOUSETYPE_MODE | 48.46 | 23619 |
| ENTRANCES_AVG | 48.37 | 23579 |
| FLOORSMAX_AVG | 47.84 | 23321 |
| YEARS_BEGINEXPLUATATION_AVG | 46.89 | 22856 |
| EMERGENCYSTATE_MODE | 45.56 | 22209 |
| EXT_SOURCE_1 | 42.12 | 20532 |
| OCCUPATION_TYPE | 32.01 | 15605 |
| EXT_SOURCE_3 | 17.78 | 8668 |
| AMT_REQ_CREDIT_BUREAU_MON | 12.41 | 6049 |
| AMT_REQ_CREDIT_BUREAU_YEAR | 12.41 | 6049 |
| NAME_TYPE_SUITE | 1.87 | 911 |
| OBS_60_CNT_SOCIAL_CIRCLE | 0.06 | 29 |
| DEF_60_CNT_SOCIAL_CIRCLE | 0.06 | 29 |
| AMT_ANNUITY | 0.05 | 24 |
datasets['application_test'].shape
(48744, 51)
datasets['application_test'].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48744 entries, 0 to 48743 Data columns (total 51 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 48744 non-null int64 1 NAME_CONTRACT_TYPE 48744 non-null object 2 CODE_GENDER 48744 non-null object 3 FLAG_OWN_CAR 48744 non-null object 4 FLAG_OWN_REALTY 48744 non-null object 5 AMT_INCOME_TOTAL 48744 non-null float64 6 AMT_CREDIT 48744 non-null float64 7 AMT_ANNUITY 48720 non-null float64 8 NAME_TYPE_SUITE 47833 non-null object 9 NAME_INCOME_TYPE 48744 non-null object 10 NAME_EDUCATION_TYPE 48744 non-null object 11 NAME_FAMILY_STATUS 48744 non-null object 12 NAME_HOUSING_TYPE 48744 non-null object 13 REGION_POPULATION_RELATIVE 48744 non-null float64 14 DAYS_BIRTH 48744 non-null int64 15 DAYS_EMPLOYED 48744 non-null int64 16 DAYS_REGISTRATION 48744 non-null float64 17 DAYS_ID_PUBLISH 48744 non-null int64 18 OWN_CAR_AGE 16432 non-null float64 19 FLAG_EMP_PHONE 48744 non-null int64 20 FLAG_WORK_PHONE 48744 non-null int64 21 FLAG_PHONE 48744 non-null int64 22 OCCUPATION_TYPE 33139 non-null object 23 CNT_FAM_MEMBERS 48744 non-null float64 24 REGION_RATING_CLIENT_W_CITY 48744 non-null int64 25 WEEKDAY_APPR_PROCESS_START 48744 non-null object 26 HOUR_APPR_PROCESS_START 48744 non-null int64 27 REG_CITY_NOT_WORK_CITY 48744 non-null int64 28 ORGANIZATION_TYPE 48744 non-null object 29 EXT_SOURCE_1 28212 non-null float64 30 EXT_SOURCE_2 48736 non-null float64 31 EXT_SOURCE_3 40076 non-null float64 32 APARTMENTS_AVG 24857 non-null float64 33 BASEMENTAREA_AVG 21103 non-null float64 34 YEARS_BEGINEXPLUATATION_AVG 25888 non-null float64 35 ENTRANCES_AVG 25165 non-null float64 36 FLOORSMAX_AVG 25423 non-null float64 37 LANDAREA_AVG 20490 non-null float64 38 FONDKAPREMONT_MODE 15947 non-null object 39 HOUSETYPE_MODE 25125 non-null object 40 WALLSMATERIAL_MODE 24851 non-null object 41 EMERGENCYSTATE_MODE 26535 non-null object 42 OBS_60_CNT_SOCIAL_CIRCLE 48715 non-null float64 43 DEF_60_CNT_SOCIAL_CIRCLE 48715 non-null float64 44 FLAG_DOCUMENT_3 48744 non-null int64 45 FLAG_DOCUMENT_6 48744 non-null int64 46 FLAG_DOCUMENT_13 48744 non-null int64 47 FLAG_DOCUMENT_14 48744 non-null int64 48 FLAG_DOCUMENT_16 48744 non-null int64 49 AMT_REQ_CREDIT_BUREAU_MON 42695 non-null float64 50 AMT_REQ_CREDIT_BUREAU_YEAR 42695 non-null float64 dtypes: float64(20), int64(15), object(16) memory usage: 19.0+ MB
datasets['application_test'].describe()
| SK_ID_CURR | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | ... | LANDAREA_AVG | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_16 | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 48744.000000 | 4.874400e+04 | 4.874400e+04 | 48720.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 48744.000000 | 16432.000000 | ... | 20490.000000 | 48715.000000 | 48715.000000 | 48744.000000 | 48744.000000 | 48744.0 | 48744.0 | 48744.0 | 42695.000000 | 42695.000000 |
| mean | 277796.676350 | 1.784318e+05 | 5.167404e+05 | 29426.240209 | 0.021226 | -16068.084605 | 67485.366322 | -4967.652716 | -3051.712949 | 11.786027 | ... | 0.067192 | 1.435738 | 0.101139 | 0.786620 | 0.087477 | 0.0 | 0.0 | 0.0 | 0.009299 | 1.983769 |
| std | 103169.547296 | 1.015226e+05 | 3.653970e+05 | 16016.368315 | 0.014428 | 4325.900393 | 144348.507136 | 3552.612035 | 1569.276709 | 11.462889 | ... | 0.081909 | 3.580125 | 0.403791 | 0.409698 | 0.282536 | 0.0 | 0.0 | 0.0 | 0.110924 | 1.838873 |
| min | 100001.000000 | 2.694150e+04 | 4.500000e+04 | 2295.000000 | 0.000253 | -25195.000000 | -17463.000000 | -23722.000000 | -6348.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 25% | 188557.750000 | 1.125000e+05 | 2.606400e+05 | 17973.000000 | 0.010006 | -19637.000000 | -2910.000000 | -7459.250000 | -4448.000000 | 4.000000 | ... | 0.019000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| 50% | 277549.000000 | 1.575000e+05 | 4.500000e+05 | 26199.000000 | 0.018850 | -15785.000000 | -1293.000000 | -4490.000000 | -3234.000000 | 9.000000 | ... | 0.048300 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 2.000000 |
| 75% | 367555.500000 | 2.250000e+05 | 6.750000e+05 | 37390.500000 | 0.028663 | -12496.000000 | -296.000000 | -1901.000000 | -1706.000000 | 15.000000 | ... | 0.086800 | 2.000000 | 0.000000 | 1.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 | 3.000000 |
| max | 456250.000000 | 4.410000e+06 | 2.245500e+06 | 180576.000000 | 0.072508 | -7338.000000 | 365243.000000 | 0.000000 | 0.000000 | 74.000000 | ... | 1.000000 | 351.000000 | 24.000000 | 1.000000 | 1.000000 | 0.0 | 0.0 | 0.0 | 6.000000 | 17.000000 |
8 rows × 35 columns
df_finalm_bureau_test = pd.merge(left = datasets['application_test'], right = df_bureau_final,
how='left', left_on='SK_ID_CURR', right_on='SK_ID_CURR')
df_finalm_bureau_test.shape
(48744, 57)
msno.bar(df_finalm_bureau_test)
<AxesSubplot:>
df_final_test = pd.merge(left=df_finalm_bureau_test, right=df_prev_ccbalinst, how='left',
left_on='SK_ID_CURR', right_on='SK_ID_CURR')
df_final_test.shape
(48797, 70)
df_final_test['DAYS_EMPLOYED_PCT'] = df_final_test['DAYS_EMPLOYED'] / df_final_test['DAYS_BIRTH']
df_final_test['CREDIT_INCOME_PCT'] = df_final_test['AMT_CREDIT'] / df_final_test['AMT_INCOME_TOTAL']
df_final_test['ANNUITY_INCOME_PCT'] = df_final_test['AMT_ANNUITY_x'] / df_final_test['AMT_INCOME_TOTAL']
df_final_test['CREDIT_TERM_PCT'] = df_final_test['AMT_ANNUITY_x'] / df_final_test['AMT_CREDIT']
df_final_test['AMT_BALANCE_PCT'] = df_final_test['AMT_BALANCE'] / df_final_test['DAYS_CREDIT']
df_final_test['AVG_INCOME_EXT_PCT'] = (df_final_test['EXT_SOURCE_1']+df_final_test['EXT_SOURCE_2']
+df_final_test['EXT_SOURCE_3'])/3
df_final_test['AVG_TOTALINCOME_PCT'] = df_final_test['AMT_INCOME_TOTAL']/df_final_test['AVG_INCOME_EXT_PCT']
df_final_merged.shape
(92288, 78)
df_final_test.shape
(48797, 77)
metrics.plot_roc_curve(clf_pipe_merged, X_test_merged, y_test_merged)
<sklearn.metrics._plot.roc_curve.RocCurveDisplay at 0x7fdf2f921430>
datasets.keys()
dict_keys(['application_train', 'application_test', 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments', 'previous_application', 'POS_CASH_balance'])
datasets["application_train"].head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | ... | EMERGENCYSTATE_MODE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_16 | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 202500.0 | 406597.5 | 24700.5 | Unaccompanied | ... | No | 2.0 | 2.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 270000.0 | 1293502.5 | 35698.5 | Family | ... | No | 1.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 67500.0 | 135000.0 | 6750.0 | Unaccompanied | ... | NaN | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 135000.0 | 312682.5 | 29686.5 | Unaccompanied | ... | NaN | 2.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 121500.0 | 513000.0 | 21865.5 | Unaccompanied | ... | NaN | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
5 rows × 52 columns
len(datasets["application_train"]["SK_ID_CURR"].unique()) == datasets["application_train"].shape[0]
True
np.intersect1d(datasets["application_train"]["SK_ID_CURR"], datasets["application_test"]["SK_ID_CURR"])
array([], dtype=int64)
datasets["application_test"].shape
(48744, 51)
datasets["application_train"].shape
(307511, 52)
pred= clf_pipe.predict(datasets["application_test"])
pred_df = pd.DataFrame(pred)
pred_df.shape
(48744, 1)
pred_df.value_counts()
0 48649 1 95 dtype: int64
pred_proba = clf_pipe.predict_proba(datasets["application_test"])
pred_proba_df = pd.DataFrame(pred_proba)
pred_proba_df
| 0 | 1 | |
|---|---|---|
| 0 | 0.939066 | 0.060934 |
| 1 | 0.766156 | 0.233844 |
| 2 | 0.947208 | 0.052792 |
| 3 | 0.950071 | 0.049929 |
| 4 | 0.857568 | 0.142432 |
| ... | ... | ... |
| 48739 | 0.970325 | 0.029675 |
| 48740 | 0.946596 | 0.053404 |
| 48741 | 0.941403 | 0.058597 |
| 48742 | 0.928029 | 0.071971 |
| 48743 | 0.904105 | 0.095895 |
48744 rows × 2 columns
class_1_proba = pred_proba_df[[1]]
datasets['application_test'].head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | NAME_TYPE_SUITE | NAME_INCOME_TYPE | ... | EMERGENCYSTATE_MODE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_16 | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 135000.0 | 568800.0 | 20560.5 | Unaccompanied | Working | ... | No | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 0.0 |
| 1 | 100005 | Cash loans | M | N | Y | 99000.0 | 222768.0 | 17370.0 | Unaccompanied | Working | ... | NaN | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 3.0 |
| 2 | 100013 | Cash loans | M | Y | Y | 202500.0 | 663264.0 | 69777.0 | NaN | Working | ... | NaN | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 4.0 |
| 3 | 100028 | Cash loans | F | N | Y | 315000.0 | 1575000.0 | 49018.5 | Unaccompanied | Working | ... | No | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | 0.0 | 3.0 |
| 4 | 100038 | Cash loans | M | Y | N | 180000.0 | 625500.0 | 32067.0 | Unaccompanied | Working | ... | NaN | 0.0 | 0.0 | 1 | 0 | 0 | 0 | 0 | NaN | NaN |
5 rows × 51 columns
final = pd.DataFrame()
final['SK_ID_CURR'] = datasets['application_test']['SK_ID_CURR']
final['TARGET'] = class_1_proba[1]
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.060934 |
| 1 | 100005 | 0.233844 |
| 2 | 100013 | 0.052792 |
| 3 | 100028 | 0.049929 |
| 4 | 100038 | 0.142432 |
| ... | ... | ... |
| 48739 | 456221 | 0.029675 |
| 48740 | 456222 | 0.053404 |
| 48741 | 456223 | 0.058597 |
| 48742 | 456224 | 0.071971 |
| 48743 | 456250 | 0.095895 |
48744 rows × 2 columns
final = final.set_index('SK_ID_CURR')
final.to_csv('submission.csv')
pred2= gs.predict(df_final_test)
pred_df2 = pd.DataFrame(pred2)
pred_df2.shape
(48797, 1)
pred_df2.value_counts()
0 48740 1 57 dtype: int64
pred_proba_part2 = gs_rf.predict_proba(df_final_test)
pred_proba_df = pd.DataFrame(pred_proba_part2)
pred_proba_part3 = gs.predict_proba(df_final_test)
pred_proba_df_log = pd.DataFrame(pred_proba_part3)
pred_proba_df
| 0 | 1 | |
|---|---|---|
| 0 | 0.95 | 0.05 |
| 1 | 0.90 | 0.10 |
| 2 | 0.95 | 0.05 |
| 3 | 0.95 | 0.05 |
| 4 | 0.84 | 0.16 |
| ... | ... | ... |
| 48792 | 0.96 | 0.04 |
| 48793 | 0.86 | 0.14 |
| 48794 | 0.93 | 0.07 |
| 48795 | 0.94 | 0.06 |
| 48796 | 0.77 | 0.23 |
48797 rows × 2 columns
pred_proba_df_log
| 0 | 1 | |
|---|---|---|
| 0 | 0.937469 | 0.062531 |
| 1 | 0.826831 | 0.173169 |
| 2 | 0.960347 | 0.039653 |
| 3 | 0.965731 | 0.034269 |
| 4 | 0.805910 | 0.194090 |
| ... | ... | ... |
| 48792 | 0.973858 | 0.026142 |
| 48793 | 0.941609 | 0.058391 |
| 48794 | 0.967976 | 0.032024 |
| 48795 | 0.942264 | 0.057736 |
| 48796 | 0.878175 | 0.121825 |
48797 rows × 2 columns
class_1_proba_df = pred_proba_df[[1]]
class_1_proba_df_log = pred_proba_df_log[[1]]
df_final_test.head()
| SK_ID_CURR | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY_x | NAME_TYPE_SUITE | NAME_INCOME_TYPE | ... | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | DAYS_EMPLOYED_PCT | CREDIT_INCOME_PCT | ANNUITY_INCOME_PCT | CREDIT_TERM_PCT | AMT_BALANCE_PCT | AVG_INCOME_EXT_PCT | AVG_TOTALINCOME_PCT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100001 | Cash loans | F | N | Y | 135000.0 | 568800.0 | 20560.5 | Unaccompanied | Working | ... | NaN | NaN | NaN | 0.121044 | 4.213333 | 0.152300 | 0.036147 | NaN | 0.567263 | 237984.936457 |
| 1 | 100005 | Cash loans | M | N | Y | 99000.0 | 222768.0 | 17370.0 | Unaccompanied | Working | ... | NaN | NaN | NaN | 0.247398 | 2.250182 | 0.175455 | 0.077973 | NaN | 0.429869 | 230302.648536 |
| 2 | 100013 | Cash loans | M | Y | Y | 202500.0 | 663264.0 | 69777.0 | NaN | Working | ... | NaN | NaN | NaN | 0.222477 | 3.275378 | 0.344578 | 0.105202 | NaN | NaN | NaN |
| 3 | 100028 | Cash loans | F | N | Y | 315000.0 | 1575000.0 | 49018.5 | Unaccompanied | Working | ... | 0.0 | 2.613636 | 19.547619 | 0.133515 | 5.000000 | 0.155614 | 0.031123 | -4.4399 | 0.549372 | 573382.213537 |
| 4 | 100038 | Cash loans | M | Y | N | 180000.0 | 625500.0 | 32067.0 | Unaccompanied | Working | ... | NaN | NaN | NaN | 0.168021 | 3.475000 | 0.178150 | 0.051266 | NaN | NaN | NaN |
5 rows × 77 columns
finalp2 = pd.DataFrame()
finalp3 = pd.DataFrame()
finalp2['SK_ID_CURR'] = df_final_test['SK_ID_CURR']
finalp2['TARGET'] = class_1_proba_df[1]
finalp3['SK_ID_CURR'] = df_final_test['SK_ID_CURR']
finalp3['TARGET'] = class_1_proba_df_log[1]
finalp2
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.05 |
| 1 | 100005 | 0.10 |
| 2 | 100013 | 0.05 |
| 3 | 100028 | 0.05 |
| 4 | 100038 | 0.16 |
| ... | ... | ... |
| 48792 | 456221 | 0.04 |
| 48793 | 456222 | 0.14 |
| 48794 | 456223 | 0.07 |
| 48795 | 456224 | 0.06 |
| 48796 | 456250 | 0.23 |
48797 rows × 2 columns
finalp3
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.062531 |
| 1 | 100005 | 0.173169 |
| 2 | 100013 | 0.039653 |
| 3 | 100028 | 0.034269 |
| 4 | 100038 | 0.194090 |
| ... | ... | ... |
| 48792 | 456221 | 0.026142 |
| 48793 | 456222 | 0.058391 |
| 48794 | 456223 | 0.032024 |
| 48795 | 456224 | 0.057736 |
| 48796 | 456250 | 0.121825 |
48797 rows × 2 columns
finalp2 = finalp2.drop_duplicates(subset='SK_ID_CURR', keep="first")
finalp2
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.05 |
| 1 | 100005 | 0.10 |
| 2 | 100013 | 0.05 |
| 3 | 100028 | 0.05 |
| 4 | 100038 | 0.16 |
| ... | ... | ... |
| 48792 | 456221 | 0.04 |
| 48793 | 456222 | 0.14 |
| 48794 | 456223 | 0.07 |
| 48795 | 456224 | 0.06 |
| 48796 | 456250 | 0.23 |
48744 rows × 2 columns
finalp3 = finalp3.drop_duplicates(subset='SK_ID_CURR', keep="first")
finalp3
| SK_ID_CURR | TARGET | |
|---|---|---|
| 0 | 100001 | 0.062531 |
| 1 | 100005 | 0.173169 |
| 2 | 100013 | 0.039653 |
| 3 | 100028 | 0.034269 |
| 4 | 100038 | 0.194090 |
| ... | ... | ... |
| 48792 | 456221 | 0.026142 |
| 48793 | 456222 | 0.058391 |
| 48794 | 456223 | 0.032024 |
| 48795 | 456224 | 0.057736 |
| 48796 | 456250 | 0.121825 |
48744 rows × 2 columns
finalp2 = finalp2.set_index('SK_ID_CURR')
finalp3 = finalp3.set_index('SK_ID_CURR')
finalp2.to_csv('submission.csv')
finalp3.to_csv('submission.csv')
! kaggle competitions submit -c home-credit-default-risk -f submission.csv -m "baseline submission"
/bin/bash: kaggle: command not found
The course project is based on the Home Credit Default Risk (HCDR) (https://www.kaggle.com/c/home-credit-default-risk/). The objective of Home Credit Default Risk project is to correctly offer loans to individuals who can pay back and turn away those who cannot. In other words, the goal is to predict whether or not a client will repay a loan. In this phase of the project, we will perform analysis and modelling of the Home Credit default risk data currently hosted on Kaggle. The objective of this phase of the project is to perform exploratory data analysis on the data, which includes describing the data, calculating the summary statistics on the data to summarize its main characteristics, visualizing the results, finding missing items count for all the input variables , etc. This step tells us what the data can tell us before we start modeling the data and to perform initial investigations on data so as to discover patterns, to spot irregularities and biases, and to check assumptions with help of summary statistics. The EDA process also involved dropping the columns with the most missing value counts. Subsequently, we built correlation matrix to find and remove highly correlated variables which would render our model inefficient. To avoid any inaccuracy, we Split the dataframe into train and test subsets. Next, we create pipelines to separately impute numerical and categorical values.We use one-hot encoding to transform the categorical variable values. Finally, we have used Logistic Regression and Random Forest to predict our target variable according to our input variables.
The task at hand for this phase of the project is to
Provide diagrams to aid understanding the workflow
Before deriving insights from the data, we need to know the data which gives a better idea of the problem at hand and the irregularities in the data are exposed with further analysis which need to be corrected before building the model. Here, we look at application_train.csv dataset.
Steps performed in this phase of project:
Identify the types of data available - Before proceeding for the further analysis of the data, we determine the datatypes of our variables. This uncovers several issues that can be encountered later, for example not converting the datatypes of the variables which seem to be correct but essentially need to be represented as another datatype. datasets['application_train'].info(verbose=True) info() prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
Evaluate basic statistical information about the data To get an idea of the average value of the data in the data set, we measure the central tendencies of our dataset. datasets["application_train"].describe(include='all') describe() prints descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution.
Finding number of numerical variables and categorical variables It is important to deal with numerical and categorical variables differently. Categorical features have a lot to say about the dataset thus they should be converted to numerical to make it into a machine-readable format.
Check and determine missing data
Before we can use data with missing data fields, we need to transform these fields to be used for analysis and modelling. It is very important to handle missing data either by deleting or through imputation (handling the missing values with some estimation).
We determine the number of missing data in every column and sort the values in descending order according to the missing count to determine which features have the most missing values. We transform our dataframe to include only the columns which have more than 50 percent non-missing values. Next up, we impute the remaining missing data with median for numerical variables and the most frequent variable for the categorical variables.
Visualization through correlation matrix
A Correlation Matrix is used to examine the relationship between multiple variables at the same time. When we do this calculation we get a table containing the correlation coefficients between each variable and the others. Now, the coefficient show us both the strength of the relationship and its direction (positive or negative correlations). In Python, a correlation matrix can be created using the Python packages Pandas and NumPy, for instance.
If we have a big data set, and we have an intention to explore patterns.
For use in other statistical methods. For instance, correlation matrices can be used as data when conducting exploratory factor analysis, confirmatory factor analysis, structural equation models.
Correlation matrices can also be used as a diagnostic when checking assumptions for e.g. regression analysis.
Above, we have plotted the graph of correlation of all the features with the target variable. As it can be seen above, there are approximately 30% features which are not at all correlated with the target variable. We can drop them as we do not believe that they will influence our predictions of the target variable! Next, we drop the irrelevant variables which are highly correlated with one another.
Conclusions of the Correlation Matrix After we see that there are many independent or feature variable which are related to each other, we will deal with them in a way that we drop the repetitive features, i.e. delete one of those columns which are highly correlated with each other! Thus, as we can see above, I have removed all those columns that ae related to each other, just leaving out one!
VISUAL EXPLORATORY DATA ANALYSIS
Above, we have plotted a graph which describes the loan applicant's family members. As we can see above, there are hardly any people who have a count of family members greater than 5.
Who is the highest borrower ?
As we can see Females are the highest borrowers.
Pairplot showing correlation with target variable for highly correlated variable with target.
In this section of the project, the following steps have been performed: Standardizing the data - The numerical variables have been standardized. Handling missing values - Following the good practice of imputing the missing values in the dataset, we impute the numerical as well as categorical variables as follows. All the work has been performed in pipelines. The numerical variables have been imputed with the median and the categorical variables have been imputed with the most frequent variable.
Handling categorical variables We have used one of the most common and efficient way to handle this transformation, ONE-HOT ENCODING. After the transformation, each column of the resulting data set corresponds to one unique value of each original feature. We want to implement the one-hot encoding to the data set, in such a way that the resulting sets are suitable to use in machine learning algorithms.
Model training : Now that we have explored the data, cleaned it, preprocessed it and added a new feature to it, we can start the modeling part of the project by applying Machine Learning algorithms. In this section, wel have a baseline logistic regression model and a random forest model. In the end, we will be comparing the performance of the models with the baseline models.
Baseline Logistic Regression
Logistic Regression is a powerful algorithm for classification problems that fit models for categorical data, especially for binary classification problems. Since our target (dependent) variable is categorical, using logistic regression can directly predict the probability that a customer is creditworthy (able to meet a financial obligation in a timely manner) or not, using a number of predictors.
Loss function used (data loss and regularization parts) in latex -
Accuracy and AUC/ROC - Accuracy: Accuracy represents the number of correctly classified data instances over the total number of data instances. Below is the accuracy and the AUC/RUC of our logistic regression model.
Baseline Random Forest
Random Forest is one of the most popular algorithms which assembles a large number of decision trees from the training dataset. Each decision tree represents a class prediction. This method collects the votes from these decision trees and the class with the most votes is considered as the final class. Random forest outperforms linear models because it can catch non-linear relationships between the object and the features.
Loss function used (data loss and regularization parts) in latex -
Accuracy and AUC/ROC - Accuracy: Accuracy represents the number of correctly classified data instances over the total number of data instances. ROC / AUC: An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate False Positive Rate AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
Below is the accuracy and the AUC/RUC of our random forest model.
Number of experiments conducted We have selected 2 models for prediction : Logistic Regression and Random Forest Classifier
In this phase of the project, our aim is to build on our previous observations from Phase 1 and improve the overall accuracy of our model. The model built in Phase 1 resulted in an accuracy of ~73% on Kaggle and considering the pitfalls of not incorporating additional techniques in our model, we attempt to improve the quality of our dataset and consequently our model by performing Feature Engineering, merging all the datasets available in Home Credit Default Risk dataset, and performing hyper-parameter tuning using GridSearchCV. The first part of the Phase 2 deals with merging all the datasets available to us on the basis of primary key/ combination of keys in the datasets. To check on the progress of the quality of our dataset, we perform exploratory data analysis like heatmap , correlation matrix and missing number visualization on each dataset to better understand the important features in the individual dataset and removing the insignificant features on the basis of missing value counts and highly correlated features. The second part of the Phase 2 uses the merged dataset to build additional features in our dataset which we consider can improve the predictive power of our model. These domain knowledge features prove to be successful in attempting to increase the quality of the model as they are highly correlated with the target variable. The third part of Phase 2 revolves around building pipelines and tuning the hyperparameters with GridSearchCV and incorporate the most significant in our model. We then predict on our final dataset using Logistic Regression and Random Forest algorithms thereby improving the accuracy to 92.1% which was 91.85 before and on Kaggle ~74%.
The task at hand for this phase of the project is to
The better the features that we prepare and choose from the dataset, the better the results. And this is why we perform feature engineering. We need great features that describe the structures inherent in the data. The main goal of Feature Engineering is to get the best results from our algorithms.
The following is the initial Feature Engineering performed on the dataset in Phase 1 :
In this phase, in order to improve the accuracy of our model and to utilize all the features and data from all the datasets, we first merge the datasets together.
Lets look at each dataset individually and merge. More about features engineered are explained in detail below in the notebook.
This dataset has all client's previous credits provided by other financial institutions that were reported to Credit Bureau (for clients who have a loan in our sample). For every loan in our sample, there are as many rows as number of credits the client had in Credit Bureau before the application date. Here, one of the interesting features is 'CREDIT_ACTIVE' which shows the data about the status of the credits.
We merge Bureau abd Bureau Balance datasets on on SK_ID_BUREAU.
Here, we are creating a test dataframe which contains only target with SK_ID_CURR so that we can check the correlation with the merged bureau column so that we can drop irrelevant columns. Since this dataset has not been preprocessed, we perform EDA on this merged dataset by dropping missing values and columns which are highly correlated with the target column from our merged dataset.
From the missing no graph, we see that 'AMT_CREDIT_SUM_LIMIT' has a lot of missing values, so we impute this column by replacing null values with 0.
The merged dataset of bureau and bureau balance only contains the features that are important to the final dataset and hence, we perform feature selection to choose only the most important features of the dataset.
We merge our df_bureau_final with application_train on SK_ID_CURR.
This dataset consists of monthly data about previous point of sale or cash loans clients have had with Home Credit. Each row is one month of a previous point of sale or cash loan, and a single previous loan can have many rows.
We perform EDA on this dataset by finding out the missing values count and removing the irrelevant features as derived from the correlation matrix.
We merge our df_POS with application_train on SK_ID_CURR
This dataset contains the monthly data about previous credit cards clients have had with Home Credit. Each row is one month of a credit card balance, and a single credit card can have many rows.
We perform EDA on this dataset by finding out the missing values count and removing the irrelevant features which have no or insignificant correlation witht the target variable.
We also group by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining this dataset.
Next, we will work on Installments Payments first and then join everything with Application train after joining installments payments and Credit Card Balance.
This dataset contains the data of payment history for previous loans at Home Credit. There is one row for every made payment and one row for every missed payment.
We perform EDA on this dataset by finding out the missing values count and removing the irrelevant features which have no or insignificant correlation witht the target variable.
We group by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining the datasets.
We then perform EDA and preprocess the data in the merged dataset of credit card balance and installment payments. We are doing this even after merging each individual table is already processed because there might be some columns that are now insignificant or higly correlated with each other.
This dataset contains the data of previous applications for loans at Home Credit of clients who have loans in the application data. Each current loan in the application data can have multiple previous loans. Each previous application has one row and is identified by the feature SK_ID_PREV.
We remove the columns with high correlation and containing large percentage of NULL values while performing EDA and preprocessing this dataset.
We group by SK_ID_CURR and SK_ID_PREV by mean so that we can restore as much information as we can while merging or joining
After merging all the preprocessed daasets, we perform EDA and preprocess this merged dataset as well because there might be some columns that are now insignificant or higly correlated with each other.
We merge df_PREV_CCBALINST with already merged Application_train on Bureau Data. We left join the datasets on SK_ID_CURR.
We then perform EDA on the final merged dataset because because there might be some columns that are now insignificant or higly correlated with each other.
While performing EDA and with our domain knowledge, we found out some existing features whose combination can have strong explanatory/predictive power.
Here is the description of the additional features :
Percentage of days employed - How long a person has been employed as a percentage of his life is a stronger predictor of his ability to keep paying off his loans.
We have taken this parameter because the total number of days a person has been employed in his life time will impact in his credit paying ability and since this is highly correlated to the target variable we have taken this feature to create new feature out of it.
Available credit as a percentage of income - If a person has a very large amount of credit available as a percentage of income, this can impact his ability to pay off the loans If a person has a very high credit limit with respect to its earning then this will effect much in its loan paying capability and he or she will or maybe most of the times will not be able to pay back.
Annuity as a percentage of income - If a person receives an annuity, this is a more stable source of income thus if it is higher, you are less likely to default.
We have taken this feature because of the reason mentioned in the previous line and since it is highly correlated with the target variable.
Annuity as a percentage of available credit - If a person receives an annuity, this is more stable source of income thus if it is a high percentage compared to his/her credit availability then the person is more likely be able to pay off his debts.
Amount Balance as the percentage - The remaining balance in the account is very much an important feature as this tells much about a person's ability to pay back it's debt. This is the reason that we took into accoun this feature also because of it's correlation with the target feature.
Average of three External Sources of Income - Since these three features are the highest among all the other features that are correlated with the target variable, we took the average of all the three incomes and created a new feature that tells us how they are effecting the prediction of target varible. This feature is almost 13% correlated.
Total Income avergae as percentage - This is the best feature that we generated out of all the seven newly made features. This shows the total income as a percent of the person's earning with respect to it's external income. We engineered this using the Feature 6 that we created above and this is almost 22% negatively correlated with the get variable.
After adding these additional features, we analyze the expert features and find that AVG_TOTALINCOME_PCT and DAYS_EMPLOYED_PCT ranks highly as measured by correlation in relation to TARGET.
Hyperparameters contain the data that govern the training process itself. Hyperparameter tuning works by running multiple trials in a single training job. In this part of the phase, we fit and store our result with GridSearchCV.
Modeling Pipelines (HCDR)
In this section of the project, the following steps have been performed: Standardizing the data - The numerical variables have been standardized. Handling missing values - Following the good practice of imputing the missing values in the dataset, we impute the numerical as well as categorical variables as follows. All the work has been performed in pipelines. The numerical variables have been imputed with the median and the categorical variables have been imputed with the most frequent variable.
We have used one of the most common and efficient way to handle this transformation, ONE-HOT ENCODING. After the transformation, each column of the resulting data set corresponds to one unique value of each original feature. We want to implement the one-hot encoding to the data set, in such a way that the resulting sets are suitable to use in machine learning algorithms.
Now that we have explored the data, preprocessed it, and performed feature engineering along with hyper parameter tuning, we can start the modeling part of the project by applying Machine Learning algorithms. In this section, we have our upgraded logistic regression model and a random forest model with GridSearchCV. In the end, we will be comparing the performance of the models.
Logistic Regression is a powerful algorithm for classification problems that fit models for categorical data, especially for binary classification problems. Since our target (dependent) variable is categorical, using logistic regression can directly predict the probability that a customer is creditworthy (able to meet a financial obligation in a timely manner) or not, using a number of predictors.
The validation accuracy of Logistic Regresion Model was 0.9198721421605808
Loss function used (data loss and regularization parts) in latex -
The log loss of the model was 2.7675196811129577
Accuracy: Accuracy represents the number of correctly classified data instances over the total number of data instances. Below is the accuracy and the AUC/RUC of our logistic regression model.
The ROC-AUC Curve of the model was 0.738818586860982
Random Forest is one of the most popular algorithms which assembles a large number of decision trees from the training dataset. Each decision tree represents a class prediction. This method collects the votes from these decision trees and the class with the most votes is considered as the final class. Random forest outperforms linear models because it can catch non-linear relationships between the object and the features.
Experiment Table with all the details of the experiment:
Accuracy represents the number of correctly classified data instances over the total number of data instances. ROC / AUC: An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate False Positive Rate AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
The validation accuracy of Random Forest Model was 0.9209556831726081
Loss function used (data loss and regularization parts) in latex - The log loss of the model was 2.730093984189766
The ROC-AUC Curve of the model was 0.7241955610236327
Initially, we tried merging all the different datasets and created several new columns to understand the relation with target column and gain fruitful insights from the dataset. On analysis, we found out that out of the new columns, a few of them were highly correlated with the target variable, thus helping us improve the model predictions. We also tried using synthetic sampling techniques like SMOTE and ADASYN to check if there was any change in the model performance.
Finally, we created a pipeline to perform Logistic Regression and Random Forest Classifier, using Hyperparameter Tuning to gather the best features Successively, we ran our model on the best parameters and found out an increase in the model accuracy and submitted the file on Kaggle.
For the Phase II of this Project, we built Logistic Regression and Random Forest Model, performed Hyperparameter Tuning using GridSearchCV and achieved an accuracy of over 92.1% with Random Forest and 92% with Logistic Regression. Initially, we tried merging all the different datasets and created several new columns to understand the relation with target column and gain fruitful insights from the dataset. On analysis, we found out that out of the new columns, a few of them were highly correlated with the target variable, thus helping us improve the model predictions. We also tried using synthetic sampling techniques like SMOTE and ADASYN to check if there was any change in the model performance. Finally, we created a pipeline to perform Logistic Regression and Random Forest Classifier, using Hyperparameter Tuning to gather the best features Successively, we ran our model on the best parameters and found out an increase in the model accuracy and submitted the file on Kaggle.
The next steps in our project would be to improve our model performance through implementation of Neural Networks.
Please provide a screenshot of your best kaggle submission.
The screenshot should show the different details of the submission and not just the score.
https://github.iu.edu/jshanah/I526_AML_Student/tree/master/Labs/Labs-10.1-Decision-Trees
https://machinelearningmastery.com/random-forest-ensemble-in-python/
https://blog.citizennet.com/blog/2012/11/10/random-forests-ensembles-and-performance-metrics
https://builtin.com/data-science/random-forest-algorithm
https://towardsdatascience.com/logistic-regression-detailed-overview-46c4da4303bc
https://github.com/rakshithvasudev/Home-Credit-Default-Risk
https://randlow.github.io/posts/machine-learning/kaggle-home-loan-credit-risk-feat-eng-p3/